Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/paragrafs/llms.txt

Use this file to discover all available pages before exploring further.

Overview

AI transcriptions are rarely perfect. Ground truth alignment solves the problem of syncing imperfect AI-generated tokens with human-edited text while preserving precise word-level timing information.

The Problem

Given:
  • AI tokens: [{text: "helo", start: 0, end: 1}, {text: "wrld", start: 1, end: 2}]
  • Human edit: "hello world"
How do we update the tokens to reflect the corrected text while keeping timing intact?

The Solution: LCS-Based Alignment

Paragrafs uses the Longest Common Subsequence (LCS) algorithm to find reliable anchor points between AI tokens and ground truth words, then intelligently fills gaps.
import { updateSegmentWithGroundTruth } from 'paragrafs';

const corrected = updateSegmentWithGroundTruth(segment, "hello world");

How It Works

Step 1: Find Anchor Points

The algorithm normalizes both sequences and builds an LCS table to find matching words:
const normalizedTokens = tokens.map(t => normalizeWord(t.text));
const normalizedGTWords = groundTruthWords.map(normalizeWord);

const lcsTable = buildLcsTable(normalizedTokens, normalizedGTWords);
const lcsMatches = extractLcsMatches(lcsTable, normalizedTokens, normalizedGTWords);
Normalization removes diacritics, punctuation, and applies NFC/NFD Unicode normalization for robust matching.

Step 2: Enforce Hard Constraints

First and last tokens are always anchored:
lcsMatches.set(0, 0);  // First token always matches first word
if (tokens.length > 1 && groundTruthWords.length > 1) {
    lcsMatches.set(tokens.length - 1, groundTruthWords.length - 1);
}

Step 3: Process Gaps Between Anchors

For each gap between anchor points: If ground truth has more words: Insert new tokens with estimated timing
const createInsertionToken = (text, { gtGap, gtGapIndex, prevToken, nextToken }) => {
    const gapStartTime = prevToken?.end ?? 0;
    const gapEndTime = nextToken.start;
    const timeAvailable = Math.max(0, gapEndTime - gapStartTime);
    
    const itemsToInsert = gtGap.length - tokenGap.length;
    const timePerItem = itemsToInsert > 0 ? timeAvailable / itemsToInsert : 0;
    
    const insertionIndex = gtGapIndex - tokenGap.length;
    const start = gapStartTime + insertionIndex * timePerItem;
    const end = start + timePerItem;
    
    return { end, start, text };
};
If AI has extra tokens: Mark them as isUnknown: true
if (gtGapIndex >= gtGap.length) {
    result.push({ ...tokenGap[tokenGapIndex], isUnknown: true });
    tokenGapIndex++;
}
If counts match: Replace token text while keeping timing
result.push({ ...tokenGap[tokenGapIndex], text: gtGap[gtGapIndex] });

Complete Example

import { updateSegmentWithGroundTruth } from 'paragrafs';

const segment = {
    start: 0,
    end: 5,
    text: "the quik brown fox",
    tokens: [
        { start: 0, end: 1, text: "the" },
        { start: 1, end: 2, text: "quik" },
        { start: 2, end: 4, text: "brown" },
        { start: 4, end: 5, text: "fox" }
    ]
};

const corrected = updateSegmentWithGroundTruth(
    segment,
    "the quick brown fox"
);

// Result:
// {
//   start: 0,
//   end: 5,
//   text: "the quick brown fox",
//   tokens: [
//     { start: 0, end: 1, text: "the" },
//     { start: 1, end: 2, text: "quick" },      // Corrected!
//     { start: 2, end: 4, text: "brown" },
//     { start: 4, end: 5, text: "fox" }
//   ]
// }

Handling Insertions and Deletions

Insertions

When ground truth has extra words, timing is distributed evenly:
const segment = {
    start: 0,
    end: 2,
    text: "hello world",
    tokens: [
        { start: 0, end: 1, text: "hello" },
        { start: 1, end: 2, text: "world" }
    ]
};

const corrected = updateSegmentWithGroundTruth(
    segment,
    "hello beautiful world"  // Added "beautiful"
);

// Result includes estimated timing for "beautiful":
// tokens: [
//   { start: 0, end: 1, text: "hello" },
//   { start: 1, end: 1.5, text: "beautiful" },  // Inserted with estimated time
//   { start: 1.5, end: 2, text: "world" }
// ]

Deletions

Extra AI tokens are marked with isUnknown: true:
const segment = {
    start: 0,
    end: 3,
    text: "hello uh world",
    tokens: [
        { start: 0, end: 1, text: "hello" },
        { start: 1, end: 2, text: "uh" },
        { start: 2, end: 3, text: "world" }
    ]
};

const corrected = updateSegmentWithGroundTruth(
    segment,
    "hello world"  // Removed "uh"
);

// Result:
// tokens: [
//   { start: 0, end: 1, text: "hello" },
//   { start: 1, end: 2, text: "uh", isUnknown: true },  // Marked as unknown
//   { start: 2, end: 3, text: "world" }
// ]
Use applyGroundTruthToSegment instead of updateSegmentWithGroundTruth to automatically filter out unknown tokens.

GroundedToken Type

The result of alignment produces GroundedToken objects:
export type GroundedToken = Token & {
    /** If true, this token was not matched during ground truth syncing */
    isUnknown?: boolean;
};

export type GroundedSegment = Omit<Segment, 'tokens'> & {
    tokens: GroundedToken[];
};

Filtering Unknown Tokens

To get a clean segment without unmatched tokens:
import { applyGroundTruthToSegment } from 'paragrafs';

const clean = applyGroundTruthToSegment(segment, groundTruth);
// Automatically filters out tokens with isUnknown: true
Or manually:
const grounded = updateSegmentWithGroundTruth(segment, groundTruth);
const clean = {
    ...grounded,
    tokens: grounded.tokens.filter(t => !t.isUnknown)
};

LCS Algorithm Details

The implementation uses classic dynamic programming:
export const buildLcsTable = (a: string[], b: string[]) => {
    const m = a.length;
    const n = b.length;
    const table: number[][] = Array.from({ length: m + 1 }, () => 
        Array(n + 1).fill(0)
    );

    for (let i = 0; i < m; i++) {
        for (let j = 0; j < n; j++) {
            if (a[i] === b[j]) {
                table[i + 1][j + 1] = table[i][j] + 1;
            } else {
                table[i + 1][j + 1] = Math.max(
                    table[i][j + 1],
                    table[i + 1][j]
                );
            }
        }
    }
    return table;
};
Complexity: O(m × n) where m and n are the lengths of the token and ground truth sequences.

Text Normalization

The normalizeWord function ensures robust matching:
export const normalizeWord = (w: string) => {
    return w
        .normalize('NFD')                           // Decompose Unicode
        .replace(/[\u200B-\u200D\uFEFF]/g, '')      // Remove zero-width chars
        .replace(/\p{Mn}/gu, '')                    // Remove combining marks
        .replace(/[\u064B-\u065F]/g, '')            // Remove Arabic diacritics
        .replace(/^[\p{P}\p{S}\p{Cf}]+|[\p{P}\p{S}\p{Cf}]+$/gu, '')  // Trim punctuation
        .normalize('NFC');                          // Recompose Unicode
};
This handles:
  • Arabic diacritics (ً ٌ ٍ َ ُ ِ ّ etc.)
  • Zero-width characters
  • Unicode normalization (NFD/NFC)
  • Leading/trailing punctuation

Best Practices

Always normalize ground truth consistently with how tokens were normalized during alignment.
Insertion timing is estimated by distributing available time evenly. For critical applications, consider manual timing adjustment.

Next Steps

Segments and Tokens

Learn about the core data structures

Hints System

Explore normalized hint matching