Documentation Index Fetch the complete documentation index at: https://mintlify.com/ragaeeb/paragrafs/llms.txt
Use this file to discover all available pages before exploring further.
Hint generation functions mine frequent n-grams from token streams and return sorted hint candidates. This is particularly useful for Arabic transcripts where repeated phrases like “أحسن الله إليكم” should trigger segment breaks.
generateHintsFromTokens
Mine frequent n-grams from a token stream and return hint candidates sorted by frequency. This is Arabic-first: mining is performed on normalized token text.
function generateHintsFromTokens (
tokens : Token [],
options ?: GenerateHintsOptions
) : GeneratedHint []
Parameters
Array of tokens to mine for repeated phrases
Configuration options for hint generation Minimum n-gram length (number of words)
Maximum n-gram length (number of words)
Minimum number of occurrences to be considered a hint
Maximum number of hints to return
dedupe
'closed' | 'none'
default: "closed"
Deduplication strategy:
'closed': Remove sub-phrases that always appear within longer phrases
'none': Keep all phrases
Words to ignore when mining (phrases consisting only of stopwords are skipped)
normalization
ArabicNormalizationOptions
Arabic normalization options (defaults: normalizeAlef: true, normalizeYa: true, removeTatweel: true)
Returns
Array of generated hints sorted by frequency, then length, then alphabetically Show GeneratedHint properties
The most common surface form of this phrase
The normalized version used for matching
Number of times this phrase appears
Number of words in the phrase
Token index where this phrase first appears
Up to 3 most common variations of this phrase
Example
import { generateHintsFromTokens , createHints } from 'paragrafs' ;
const tokens = [
{ start: 0 , end: 1 , text: 'أَحْسَنَ' },
{ start: 1 , end: 2 , text: 'الله' },
{ start: 2 , end: 3 , text: 'إليكم،' },
{ start: 3 , end: 4 , text: 'شيخنا' },
{ start: 5 , end: 6 , text: 'أَحْسَنَ' },
{ start: 6 , end: 7 , text: 'الله' },
{ start: 7 , end: 8 , text: 'إليكم' },
// ... more tokens ...
];
const mined = generateHintsFromTokens ( tokens , {
minN: 2 ,
maxN: 4 ,
minCount: 2 ,
dedupe: 'closed' ,
normalization: { normalizeAlef: true }
});
console . log ( mined );
// [
// {
// phrase: 'أحسن الله إليكم',
// normalizedPhrase: 'احسن الله اليكم',
// count: 2,
// length: 3,
// firstOccurrenceIndex: 0,
// topSurfaceForms: ['أحسن الله إليكم', 'أَحْسَنَ الله إليكم،']
// }
// ]
// Convert top hints to usable hints
const hints = createHints (
{ normalizeAlef: true },
... mined . slice ( 0 , 25 ). map ( h => h . phrase )
);
Use Cases
Auto-discovery : Automatically find repeated phrases in long transcripts
Quality improvement : Identify common expressions that should trigger segment breaks
Arabic lectures : Find repeated phrases like greetings, blessings, and transitions
Custom segmentation : Use discovered phrases to improve transcript formatting
generateHintsFromSegments
Mine frequent n-grams from segments. By default, phrases cannot cross segment boundaries (use boundaryStrategy: 'none' to mine across boundaries).
function generateHintsFromSegments (
segments : Segment [],
options ?: GenerateHintsOptions
) : GeneratedHint []
Parameters
Array of segments to mine for repeated phrases
Configuration options (same as generateHintsFromTokens, plus boundaryStrategy) Show additional properties
boundaryStrategy
'segment' | 'none'
default: "segment"
'segment': Phrases cannot cross segment boundaries (mines per-segment and merges results)
'none': Phrases can cross segment boundaries (treats all tokens as one stream)
Returns
Array of generated hints sorted by frequency, then length, then alphabetically
Example
import { generateHintsFromSegments } from 'paragrafs' ;
const segments = [
{
start: 0 ,
end: 10 ,
text: 'أحسن الله إليكم يا شيخ' ,
tokens: [
{ start: 0 , end: 2 , text: 'أحسن' },
{ start: 2 , end: 4 , text: 'الله' },
{ start: 4 , end: 6 , text: 'إليكم' },
{ start: 6 , end: 8 , text: 'يا' },
{ start: 8 , end: 10 , text: 'شيخ' }
]
},
{
start: 10 ,
end: 18 ,
text: 'بارك الله فيكم' ,
tokens: [
{ start: 10 , end: 13 , text: 'بارك' },
{ start: 13 , end: 15 , text: 'الله' },
{ start: 15 , end: 18 , text: 'فيكم' }
]
},
// ... more segments with repeated phrases ...
];
// Default: phrases don't cross segment boundaries
const hints = generateHintsFromSegments ( segments , {
minN: 2 ,
maxN: 4 ,
minCount: 2
});
// Allow phrases to cross segment boundaries
const crossBoundaryHints = generateHintsFromSegments ( segments , {
minN: 2 ,
maxN: 4 ,
minCount: 2 ,
boundaryStrategy: 'none'
});
Boundary Strategy Comparison
boundaryStrategy: 'segment' (default)
Mines each segment independently
Merges results across segments
Phrases like “end_of_segment start_of_next” won’t be detected
Recommended for most use cases
boundaryStrategy: 'none'
Treats all segments as one continuous token stream
Can detect phrases that span segment boundaries
May find less meaningful phrases at segment edges
Useful for finding transitions between segments
Use Cases
Segment-aware mining : Find repeated phrases within natural segment boundaries
Lecture analysis : Identify repeated expressions in educational content
Quality metrics : Measure how often specific phrases appear across a transcript
Custom formatting : Use discovered patterns to improve segment formatting