Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ragaeeb/paragrafs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The hints system allows you to specify multi-word phrases that should always trigger paragraph breaks. It’s particularly powerful for Arabic transcriptions with its built-in normalization support.
Creating Hints
Hints are created from one or more phrases:
import { createHints } from 'paragrafs';
const hints = createHints(
'next topic',
'moving on',
'in conclusion'
);
How Hints Work
Hints are organized into a map indexed by the first word of each phrase:
export type HintMap = Record<string, string[][]>;
export type Hints = {
map: HintMap;
normalization: Required<ArabicNormalizationOptions>;
};
Example Structure
const hints = createHints('next topic', 'next section', 'moving on');
// Internal structure:
// {
// map: {
// "next": [["next", "topic"], ["next", "section"]],
// "moving": [["moving", "on"]]
// },
// normalization: { ... }
// }
ALWAYS_BREAK Marker
When a hint matches, the ALWAYS_BREAK marker is inserted:
if (hints && normalizedTexts && isHintMatched(normalizedTexts, hints, idx)) {
marked.push(ALWAYS_BREAK);
}
ALWAYS_BREAK creates hard paragraph boundaries that cannot be merged, unlike SEGMENT_BREAK which is a soft suggestion.
Normalization Options
Hints support Arabic-specific normalization for robust matching:
export type ArabicNormalizationOptions = {
normalizeAlef?: boolean; // أإآ → ا
normalizeHamza?: boolean; // ؤئ → ء
normalizeYa?: boolean; // ى → ي
removeTatweel?: boolean; // Remove ـ
};
Default Normalization
const DEFAULT_HINT_NORMALIZATION = {
normalizeAlef: true,
normalizeHamza: false,
normalizeYa: true,
removeTatweel: true
};
Custom Normalization
Override normalization by passing options as the first argument:
const hints = createHints(
{
normalizeAlef: true,
normalizeHamza: true,
normalizeYa: true,
removeTatweel: true
},
'الموضوع التالي', // Next topic in Arabic
'في الختام' // In conclusion in Arabic
);
All hints in a single createHints call use the same normalization settings.
Token Text Normalization
The normalizeTokenText function applies the same normalization to both hints and tokens:
export const normalizeTokenText = (
text: string,
options?: ArabicNormalizationOptions
): string => {
let input = text;
// Hamza normalization (if enabled)
if (options?.normalizeHamza) {
input = input
.normalize('NFD')
.replace(/\u064A\p{Mn}*\u0654/gu, 'ء') // ي + hamza
.replace(/\u0648\p{Mn}*\u0654/gu, 'ء') // و + hamza
.replace(/[\u0654\u0655]/g, '') // Remove hamza marks
.normalize('NFC');
}
let normalized = normalizeWord(input);
if (options?.removeTatweel) {
normalized = normalized.replace(/\u0640/g, '');
}
if (options?.normalizeAlef) {
normalized = normalized.replace(/[أإآ]/g, 'ا');
}
if (options?.normalizeYa) {
normalized = normalized.replace(/ى/g, 'ي');
}
return normalized;
};
Hint Matching Algorithm
Matching happens in two steps:
1. Normalize All Tokens
const normalizedTexts = hints
? tokens.map(t => normalizeTokenText(t.text, hints.normalization))
: null;
2. Check for Matches
export const isHintMatched = (
normalizedTokens: string[],
hints: Hints,
index: number
): boolean => {
const key = normalizedTokens[index];
const candidates = hints.map[key];
if (!candidates) {
return false;
}
for (const words of candidates) {
if (isHintSequenceMatchedAtIndex(normalizedTokens, words, index)) {
return true;
}
}
return false;
};
3. Verify Sequence Match
const isHintSequenceMatchedAtIndex = (
normalizedTokens: string[],
words: string[],
index: number
): boolean => {
if (index + words.length > normalizedTokens.length) {
return false;
}
for (let k = 0; k < words.length; k++) {
if (normalizedTokens[index + k] !== words[k]) {
return false;
}
}
return true;
};
Complete Example
import {
createHints,
markTokensWithDividers,
groupMarkedTokensIntoSegments,
mergeShortSegmentsWithPrevious
} from 'paragrafs';
const tokens = [
{ start: 0, end: 1, text: "Hello" },
{ start: 1, end: 2, text: "everyone" },
{ start: 2, end: 3, text: "Next" },
{ start: 3, end: 4, text: "topic" },
{ start: 4, end: 5, text: "will" },
{ start: 5, end: 6, text: "be" }
];
// Create hints for "next topic"
const hints = createHints('next topic');
// Mark tokens with dividers
const marked = markTokensWithDividers(tokens, {
gapThreshold: 1.0,
hints
});
// Result:
// [
// { start: 0, end: 1, text: "Hello" },
// { start: 1, end: 2, text: "everyone" },
// SEGMENT_BREAK,
// ALWAYS_BREAK, // Inserted because "next topic" matched!
// { start: 2, end: 3, text: "Next" },
// { start: 3, end: 4, text: "topic" },
// { start: 4, end: 5, text: "will" },
// { start: 5, end: 6, text: "be" }
// ]
Arabic Example
const hints = createHints(
{
normalizeAlef: true,
normalizeYa: true
},
'الموضوع القادم', // "The next topic"
'وفي الختام' // "And in conclusion"
);
const tokens = [
{ start: 0, end: 1, text: "مرحبا" },
{ start: 1, end: 2, text: "بكم" },
{ start: 2, end: 3, text: "الموضوع" },
{ start: 3, end: 4, text: "القادم" },
{ start: 4, end: 5, text: "سيكون" }
];
const marked = markTokensWithDividers(tokens, {
gapThreshold: 1.0,
hints
});
// "الموضوع القادم" will be matched even with different diacritics
// or alef variants in the actual tokens!
Normalization makes matching robust against variations in diacritics, punctuation, and Arabic letter forms.
Using Hints in Paragraph Reconstruction
import { markAndCombineSegments, createHints } from 'paragrafs';
const hints = createHints(
'next section',
'to summarize',
'in conclusion',
'moving on'
);
const markedSegments = markAndCombineSegments(segments, {
fillers: ['uh', 'um'],
gapThreshold: 1.5,
maxSecondsPerSegment: 30,
minWordsPerSegment: 5,
hints // Pass hints to force breaks at these phrases
});
Finding Matching Tokens
Use getFirstMatchingToken to find where a phrase occurs:
import { getFirstMatchingToken } from 'paragrafs';
const tokens = [
{ start: 0, end: 1, text: 'the' },
{ start: 1, end: 2, text: 'quick' },
{ start: 2, end: 3, text: 'brown' },
{ start: 3, end: 4, text: 'fox' }
];
const match = getFirstMatchingToken(tokens, 'quick brown');
// Returns: { start: 1, end: 2, text: 'quick' }
const noMatch = getFirstMatchingToken(tokens, 'lazy dog');
// Returns: null
This function internally uses createHints with default normalization.
Base Normalization
All normalization builds on normalizeWord:
export const normalizeWord = (w: string) => {
return w
.normalize('NFD') // Decompose Unicode
.replace(/[\u200B-\u200D\uFEFF]/g, '') // Zero-width chars
.replace(/\p{Mn}/gu, '') // Combining marks
.replace(/[\u064B-\u065F]/g, '') // Arabic diacritics
.replace(/^[\p{P}\p{S}\p{Cf}]+|[\p{P}\p{S}\p{Cf}]+$/gu, '') // Trim punctuation
.normalize('NFC'); // Recompose Unicode
};
This handles:
- Unicode normalization: NFD/NFC for consistent representation
- Zero-width characters: U+200B–U+200D, U+FEFF
- Combining marks: Diacritical marks (\p)
- Arabic diacritics: U+064B–U+065F (fatḥa, ḍamma, kasra, etc.)
- Punctuation: Leading/trailing symbols
Hints are efficient because:
- First-word indexing: Only phrases starting with the current token are checked
- Early termination: Matching stops as soon as sequence fails
- Single normalization pass: Tokens are normalized once, not per hint
For large hint sets, the lookup time is O(1) for first-word matching, then O(n) where n is the number of hints sharing the same first word.
Best Practices
Group related phrases with the same normalization settings in a single createHints call.
Very short hints (1-2 words) may cause false positives. Use longer, more specific phrases when possible.
Next Steps
Paragraph Reconstruction
Learn how hints integrate with the full reconstruction pipeline
Ground Truth Alignment
Understand normalization in the context of LCS alignment