Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ragaeeb/paragrafs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Auto-hint generation is a powerful feature that mines frequent n-grams (repeated phrases) from your transcription data. This is especially useful for Arabic transcriptions where common religious phrases or formulaic expressions appear frequently.Why Use Hints?
Hints allow you to mark specific phrases that should trigger paragraph breaks, creating more natural segmentation. Common use cases include:- Religious phrases (“أحسن الله إليكم”, “بارك الله فيكم”)
- Recurring formulaic expressions
- Speaker transitions or section markers
- Domain-specific terminology
Basic Hint Generation
UsegenerateHintsFromTokens to discover repeated phrases:
How It Works
Normalize tokens
Tokens are normalized using Arabic-first normalization (diacritics removal, alef/ya normalization).
Count n-grams
Frequent n-grams are counted from the normalized token stream based on
minN and maxN parameters.GeneratedHint Object
The mining process returnsGeneratedHint objects:
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
minN | number | 2 | Minimum phrase length in words |
maxN | number | 6 | Maximum phrase length in words |
minCount | number | 2 | Minimum occurrences required |
topK | number | Infinity | Maximum number of hints to return |
dedupe | 'closed' | 'none' | 'closed' | Deduplication strategy |
stopwords | string[] | [] | Words to ignore (phrases of only stopwords excluded) |
normalization | ArabicNormalizationOptions | See below | Normalization settings |
boundaryStrategy | 'none' | 'segment' | 'segment' | Whether phrases can cross segment boundaries |
Normalization Options
Default normalization for Arabic-first processing:Using Generated Hints
Once you’ve mined hints, convert them to aHints object and use them during segmentation:
When a hint is matched, an
ALWAYS_BREAK marker is inserted, which creates a hard boundary that prevents segments from being merged.Generating from Segments
For segment-based transcriptions, usegenerateHintsFromSegments:
Boundary Strategies
'segment'(default): Phrases cannot span across segment boundaries. Each segment is mined independently, then results are merged.'none': Phrases can span segment boundaries. All tokens are treated as a continuous stream.
Deduplication Strategies
Closed Deduplication (dedupe: 'closed')
Removes subphrases that always appear within longer phrases:
No Deduplication (dedupe: 'none')
Keeps all frequent phrases, including subphrases:
Limiting Results
Control the number of hints returned:Working with Stopwords
Exclude phrases that consist only of common words:Surface Form Variants
The mining process tracks multiple surface forms for normalized phrases:Complete Example
Best Practices
Choose appropriate minCount
Choose appropriate minCount
Set
minCount based on your corpus size. For small datasets (< 100 segments), use minCount: 2. For larger datasets, use minCount: 5 or higher to avoid noise.Tune phrase length
Tune phrase length
Start with
minN: 2, maxN: 4 for most use cases. Increase maxN to 6 or 7 for technical or religious content with longer formulaic phrases.Use normalization consistently
Use normalization consistently
Ensure the normalization options match between
generateHintsFromTokens and createHints. Mismatched normalization will prevent hints from matching.Inspect results before using
Inspect results before using
Always review the mined hints before using them in production. Some frequent phrases may not be good paragraph boundaries.
Next Steps
Arabic Support
Learn more about Arabic text normalization and hint matching