Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ragaeeb/paragrafs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
AI transcription services typically output a flat sequence of word-level tokens. Paragrafs reconstructs natural paragraph boundaries by analyzing timing gaps, punctuation, filler words, and custom hints.The Reconstruction Pipeline
Paragraph reconstruction happens in three stages:- Mark tokens with dividers - Identify natural break points
- Group into segments - Combine tokens respecting duration limits
- Merge short segments - Avoid very short paragraphs
Stage 1: Marking Tokens with Dividers
ThemarkTokensWithDividers function identifies break points based on:
Filler Words
Words like “uh”, “um”, “hmm” often indicate hesitation or thought breaks:Filler words are completely removed from the output and replaced with
SEGMENT_BREAK markers.Time Gaps
Significant pauses between words suggest natural breaks:Punctuation
Tokens ending with sentence-ending punctuation (., ?, !, etc.) trigger breaks:
., ?, !, ؟ (Arabic), ؛ (Arabic semicolon), …
Custom Hints
Multi-word phrases can be marked withALWAYS_BREAK to force paragraph boundaries:
See the Hints System documentation for details on normalization and matching.
Stage 2: Grouping into Segments
ThegroupMarkedTokensIntoSegments function combines tokens into segments while respecting:
- Maximum duration - Segments won’t exceed
maxSecondsPerSegment - Break markers -
ALWAYS_BREAKforces immediate segment boundaries - Soft breaks -
SEGMENT_BREAKsuggests boundaries when duration is exceeded
Breaking Behavior
ALWAYS_BREAK creates hard boundaries:Stage 3: Merging Short Segments
ThemergeShortSegmentsWithPrevious function combines segments with fewer than minWordsPerSegment words:
Segments containing
ALWAYS_BREAK markers are never merged, preserving intentional boundaries.Complete Example
Here’s a full reconstruction pipeline:Formatting Output
Timestamped Transcript
Generate a timestamped transcript with one line per segment:Custom Formatting
Provide your own formatter:Advanced: Cleanup Isolated Tokens
ThecleanupIsolatedTokens function removes unnecessary breaks that would create single-word lines:
SEGMENT_BREAKfollowed byALWAYS_BREAKSEGMENT_BREAKfollowed by anotherSEGMENT_BREAK- Breaks that would isolate a single word
Next Steps
Hints System
Learn about hint normalization and matching
Ground Truth Alignment
Sync reconstructed paragraphs with human edits