Hints System

Overview

The hints system allows you to specify multi-word phrases that should always trigger paragraph breaks. It’s particularly powerful for Arabic transcriptions with its built-in normalization support.

Creating Hints

Hints are created from one or more phrases:

import { createHints } from 'paragrafs';

const hints = createHints(
    'next topic',
    'moving on',
    'in conclusion'
);

How Hints Work

Hints are organized into a map indexed by the first word of each phrase:

export type HintMap = Record<string, string[][]>;

export type Hints = {
    map: HintMap;
    normalization: Required<ArabicNormalizationOptions>;
};

Example Structure

const hints = createHints('next topic', 'next section', 'moving on');

// Internal structure:
// {
//   map: {
//     "next": [["next", "topic"], ["next", "section"]],
//     "moving": [["moving", "on"]]
//   },
//   normalization: { ... }
// }

ALWAYS_BREAK Marker

When a hint matches, the ALWAYS_BREAK marker is inserted:

if (hints && normalizedTexts && isHintMatched(normalizedTexts, hints, idx)) {
    marked.push(ALWAYS_BREAK);
}

ALWAYS_BREAK creates hard paragraph boundaries that cannot be merged, unlike SEGMENT_BREAK which is a soft suggestion.

Normalization Options

Hints support Arabic-specific normalization for robust matching:

export type ArabicNormalizationOptions = {
    normalizeAlef?: boolean;    // أإآ → ا
    normalizeHamza?: boolean;   // ؤئ → ء
    normalizeYa?: boolean;      // ى → ي
    removeTatweel?: boolean;    // Remove ـ
};

Default Normalization

const DEFAULT_HINT_NORMALIZATION = {
    normalizeAlef: true,
    normalizeHamza: false,
    normalizeYa: true,
    removeTatweel: true
};

Custom Normalization

Override normalization by passing options as the first argument:

const hints = createHints(
    {
        normalizeAlef: true,
        normalizeHamza: true,
        normalizeYa: true,
        removeTatweel: true
    },
    'الموضوع التالي',  // Next topic in Arabic
    'في الختام'        // In conclusion in Arabic
);

All hints in a single createHints call use the same normalization settings.

Token Text Normalization

The normalizeTokenText function applies the same normalization to both hints and tokens:

export const normalizeTokenText = (
    text: string,
    options?: ArabicNormalizationOptions
): string => {
    let input = text;

    // Hamza normalization (if enabled)
    if (options?.normalizeHamza) {
        input = input
            .normalize('NFD')
            .replace(/\u064A\p{Mn}*\u0654/gu, 'ء')  // ي + hamza
            .replace(/\u0648\p{Mn}*\u0654/gu, 'ء')  // و + hamza
            .replace(/[\u0654\u0655]/g, '')          // Remove hamza marks
            .normalize('NFC');
    }

    let normalized = normalizeWord(input);

    if (options?.removeTatweel) {
        normalized = normalized.replace(/\u0640/g, '');
    }

    if (options?.normalizeAlef) {
        normalized = normalized.replace(/[أإآ]/g, 'ا');
    }

    if (options?.normalizeYa) {
        normalized = normalized.replace(/ى/g, 'ي');
    }

    return normalized;
};

Hint Matching Algorithm

Matching happens in two steps:

1. Normalize All Tokens

const normalizedTexts = hints 
    ? tokens.map(t => normalizeTokenText(t.text, hints.normalization))
    : null;

2. Check for Matches

export const isHintMatched = (
    normalizedTokens: string[],
    hints: Hints,
    index: number
): boolean => {
    const key = normalizedTokens[index];
    const candidates = hints.map[key];

    if (!candidates) {
        return false;
    }

    for (const words of candidates) {
        if (isHintSequenceMatchedAtIndex(normalizedTokens, words, index)) {
            return true;
        }
    }

    return false;
};

3. Verify Sequence Match

const isHintSequenceMatchedAtIndex = (
    normalizedTokens: string[],
    words: string[],
    index: number
): boolean => {
    if (index + words.length > normalizedTokens.length) {
        return false;
    }

    for (let k = 0; k < words.length; k++) {
        if (normalizedTokens[index + k] !== words[k]) {
            return false;
        }
    }

    return true;
};

Complete Example

import { 
    createHints,
    markTokensWithDividers,
    groupMarkedTokensIntoSegments,
    mergeShortSegmentsWithPrevious
} from 'paragrafs';

const tokens = [
    { start: 0, end: 1, text: "Hello" },
    { start: 1, end: 2, text: "everyone" },
    { start: 2, end: 3, text: "Next" },
    { start: 3, end: 4, text: "topic" },
    { start: 4, end: 5, text: "will" },
    { start: 5, end: 6, text: "be" }
];

// Create hints for "next topic"
const hints = createHints('next topic');

// Mark tokens with dividers
const marked = markTokensWithDividers(tokens, {
    gapThreshold: 1.0,
    hints
});

// Result:
// [
//   { start: 0, end: 1, text: "Hello" },
//   { start: 1, end: 2, text: "everyone" },
//   SEGMENT_BREAK,
//   ALWAYS_BREAK,              // Inserted because "next topic" matched!
//   { start: 2, end: 3, text: "Next" },
//   { start: 3, end: 4, text: "topic" },
//   { start: 4, end: 5, text: "will" },
//   { start: 5, end: 6, text: "be" }
// ]

Arabic Example

const hints = createHints(
    {
        normalizeAlef: true,
        normalizeYa: true
    },
    'الموضوع القادم',  // "The next topic"
    'وفي الختام'      // "And in conclusion"
);

const tokens = [
    { start: 0, end: 1, text: "مرحبا" },
    { start: 1, end: 2, text: "بكم" },
    { start: 2, end: 3, text: "الموضوع" },
    { start: 3, end: 4, text: "القادم" },
    { start: 4, end: 5, text: "سيكون" }
];

const marked = markTokensWithDividers(tokens, {
    gapThreshold: 1.0,
    hints
});

// "الموضوع القادم" will be matched even with different diacritics
// or alef variants in the actual tokens!

Normalization makes matching robust against variations in diacritics, punctuation, and Arabic letter forms.

Using Hints in Paragraph Reconstruction

import { markAndCombineSegments, createHints } from 'paragrafs';

const hints = createHints(
    'next section',
    'to summarize',
    'in conclusion',
    'moving on'
);

const markedSegments = markAndCombineSegments(segments, {
    fillers: ['uh', 'um'],
    gapThreshold: 1.5,
    maxSecondsPerSegment: 30,
    minWordsPerSegment: 5,
    hints  // Pass hints to force breaks at these phrases
});

Finding Matching Tokens

Use getFirstMatchingToken to find where a phrase occurs:

import { getFirstMatchingToken } from 'paragrafs';

const tokens = [
    { start: 0, end: 1, text: 'the' },
    { start: 1, end: 2, text: 'quick' },
    { start: 2, end: 3, text: 'brown' },
    { start: 3, end: 4, text: 'fox' }
];

const match = getFirstMatchingToken(tokens, 'quick brown');
// Returns: { start: 1, end: 2, text: 'quick' }

const noMatch = getFirstMatchingToken(tokens, 'lazy dog');
// Returns: null

This function internally uses createHints with default normalization.

Base Normalization

All normalization builds on normalizeWord:

export const normalizeWord = (w: string) => {
    return w
        .normalize('NFD')                    // Decompose Unicode
        .replace(/[\u200B-\u200D\uFEFF]/g, '')  // Zero-width chars
        .replace(/\p{Mn}/gu, '')             // Combining marks
        .replace(/[\u064B-\u065F]/g, '')     // Arabic diacritics
        .replace(/^[\p{P}\p{S}\p{Cf}]+|[\p{P}\p{S}\p{Cf}]+$/gu, '')  // Trim punctuation
        .normalize('NFC');                   // Recompose Unicode
};

This handles:

Unicode normalization: NFD/NFC for consistent representation
Zero-width characters: U+200B–U+200D, U+FEFF
Combining marks: Diacritical marks (\p)
Arabic diacritics: U+064B–U+065F (fatḥa, ḍamma, kasra, etc.)
Punctuation: Leading/trailing symbols

Performance Considerations

Hints are efficient because:

First-word indexing: Only phrases starting with the current token are checked
Early termination: Matching stops as soon as sequence fails
Single normalization pass: Tokens are normalized once, not per hint

For large hint sets, the lookup time is O(1) for first-word matching, then O(n) where n is the number of hints sharing the same first word.

Best Practices

Group related phrases with the same normalization settings in a single createHints call.

Very short hints (1-2 words) may cause false positives. Use longer, more specific phrases when possible.

Getting Started

Core Concepts

Guides

API Reference

Resources

Overview

Creating Hints

How Hints Work

Example Structure

ALWAYS_BREAK Marker

Normalization Options

Default Normalization

Custom Normalization

Token Text Normalization

Hint Matching Algorithm

1. Normalize All Tokens

2. Check for Matches

3. Verify Sequence Match

Complete Example

Arabic Example

Using Hints in Paragraph Reconstruction

Finding Matching Tokens

Base Normalization

Performance Considerations

Best Practices

Next Steps

Paragraph Reconstruction

Ground Truth Alignment

Getting Started

Core Concepts

Guides

API Reference

Resources

Documentation Index

​Overview

​Creating Hints

​How Hints Work

​Example Structure

​ALWAYS_BREAK Marker

​Normalization Options

​Default Normalization

​Custom Normalization

​Token Text Normalization

​Hint Matching Algorithm

​1. Normalize All Tokens

​2. Check for Matches

​3. Verify Sequence Match

​Complete Example

​Arabic Example

​Using Hints in Paragraph Reconstruction

​Finding Matching Tokens

​Base Normalization

​Performance Considerations

​Best Practices

​Next Steps

Paragraph Reconstruction

Ground Truth Alignment

Overview

Creating Hints

How Hints Work

Example Structure

ALWAYS_BREAK Marker

Normalization Options

Default Normalization

Custom Normalization

Token Text Normalization

Hint Matching Algorithm

1. Normalize All Tokens

2. Check for Matches

3. Verify Sequence Match

Complete Example

Arabic Example

Using Hints in Paragraph Reconstruction

Finding Matching Tokens

Base Normalization

Performance Considerations

Best Practices

Next Steps