Paragraph Reconstruction

Overview

AI transcription services typically output a flat sequence of word-level tokens. Paragrafs reconstructs natural paragraph boundaries by analyzing timing gaps, punctuation, filler words, and custom hints.

The Reconstruction Pipeline

Paragraph reconstruction happens in three stages:

Mark tokens with dividers - Identify natural break points
Group into segments - Combine tokens respecting duration limits
Merge short segments - Avoid very short paragraphs

import { markAndCombineSegments } from 'paragrafs';

const markedSegments = markAndCombineSegments(segments, {
    fillers: ['uh', 'um', 'hmm'],
    gapThreshold: 1.5,              // seconds
    maxSecondsPerSegment: 30,
    minWordsPerSegment: 5
});

Stage 1: Marking Tokens with Dividers

The markTokensWithDividers function identifies break points based on:

Filler Words

Words like “uh”, “um”, “hmm” often indicate hesitation or thought breaks:

const marked = markTokensWithDividers(tokens, {
    fillers: ['uh', 'um', 'hmm'],
    gapThreshold: 1.0
});

Filler words are completely removed from the output and replaced with SEGMENT_BREAK markers.

Time Gaps

Significant pauses between words suggest natural breaks:

// If the gap between tokens exceeds gapThreshold, insert SEGMENT_BREAK
if (prevEnd !== null && token.start - prevEnd > gapThreshold) {
    marked.push(SEGMENT_BREAK);
}

Punctuation

Tokens ending with sentence-ending punctuation (., ?, !, etc.) trigger breaks:

if (isEndingWithPunctuation(token.text)) {
    marked.push(SEGMENT_BREAK);
}

Supported punctuation: ., ?, !, ؟ (Arabic), ؛ (Arabic semicolon), …

Custom Hints

Multi-word phrases can be marked with ALWAYS_BREAK to force paragraph boundaries:

import { createHints } from 'paragrafs';

const hints = createHints('next topic', 'moving on', 'in conclusion');

const marked = markTokensWithDividers(tokens, {
    gapThreshold: 1.0,
    hints
});

See the Hints System documentation for details on normalization and matching.

Stage 2: Grouping into Segments

The groupMarkedTokensIntoSegments function combines tokens into segments while respecting:

Maximum duration - Segments won’t exceed maxSecondsPerSegment
Break markers - ALWAYS_BREAK forces immediate segment boundaries
Soft breaks - SEGMENT_BREAK suggests boundaries when duration is exceeded

const segments = groupMarkedTokensIntoSegments(markedTokens, 30);

Breaking Behavior

ALWAYS_BREAK creates hard boundaries:

if (token === ALWAYS_BREAK) {
    flush();  // End current segment
    reset();  // Start new segment
    currentSegment = [ALWAYS_BREAK];
    continue;
}

SEGMENT_BREAK respects duration limits:

if (nextIsDivider && durationExceeded()) {
    flush();
    reset();
}

Stage 3: Merging Short Segments

The mergeShortSegmentsWithPrevious function combines segments with fewer than minWordsPerSegment words:

const merged = mergeShortSegmentsWithPrevious(segments, 5);

Segments containing ALWAYS_BREAK markers are never merged, preserving intentional boundaries.

Complete Example

Here’s a full reconstruction pipeline:

import { 
    markAndCombineSegments,
    mapSegmentsIntoFormattedSegments,
    createHints
} from 'paragrafs';

// 1. Create hints for domain-specific phrases
const hints = createHints(
    'next section',
    'to summarize',
    'in conclusion'
);

// 2. Mark and combine segments
const markedSegments = markAndCombineSegments(inputSegments, {
    fillers: ['uh', 'um'],
    gapThreshold: 1.5,
    maxSecondsPerSegment: 30,
    minWordsPerSegment: 5,
    hints
});

// 3. Format into clean segments
const formattedSegments = mapSegmentsIntoFormattedSegments(
    markedSegments,
    10  // max seconds per line
);

Formatting Output

Timestamped Transcript

Generate a timestamped transcript with one line per segment:

import { formatSegmentsToTimestampedTranscript } from 'paragrafs';

const transcript = formatSegmentsToTimestampedTranscript(
    markedSegments,
    10  // max seconds per line
);

// Output:
// 0:00: Hello and welcome to this presentation
// 0:05: Today we'll discuss paragraph reconstruction
// 0:12: The algorithm works in three main stages

Custom Formatting

Provide your own formatter:

const transcript = formatSegmentsToTimestampedTranscript(
    markedSegments,
    10,
    (buffer) => `[${buffer.start.toFixed(1)}s] ${buffer.text}`
);

// Output:
// [0.0s] Hello and welcome to this presentation
// [5.2s] Today we'll discuss paragraph reconstruction

Advanced: Cleanup Isolated Tokens

The cleanupIsolatedTokens function removes unnecessary breaks that would create single-word lines:

import { cleanupIsolatedTokens } from 'paragrafs';

const cleaned = cleanupIsolatedTokens(markedTokens);

This removes breaks in patterns like:

SEGMENT_BREAK followed by ALWAYS_BREAK
SEGMENT_BREAK followed by another SEGMENT_BREAK
Breaks that would isolate a single word

Next Steps

Hints System

Learn about hint normalization and matching

Ground Truth Alignment

Sync reconstructed paragraphs with human edits

Getting Started

Core Concepts

Guides

API Reference

Resources

Paragraph Reconstruction

Overview

The Reconstruction Pipeline

Stage 1: Marking Tokens with Dividers

Filler Words

Time Gaps

Punctuation

Custom Hints

Stage 2: Grouping into Segments

Breaking Behavior

Stage 3: Merging Short Segments

Complete Example

Formatting Output

Timestamped Transcript

Custom Formatting

Advanced: Cleanup Isolated Tokens

Next Steps

Hints System

Ground Truth Alignment

Getting Started

Core Concepts

Guides

API Reference

Resources

Documentation Index

​Overview

​The Reconstruction Pipeline

​Stage 1: Marking Tokens with Dividers

​Filler Words

​Time Gaps

​Punctuation

​Custom Hints

​Stage 2: Grouping into Segments

​Breaking Behavior

​Stage 3: Merging Short Segments

​Complete Example

​Formatting Output

​Timestamped Transcript

​Custom Formatting

​Advanced: Cleanup Isolated Tokens

​Next Steps

Hints System

Ground Truth Alignment

Overview

The Reconstruction Pipeline

Stage 1: Marking Tokens with Dividers

Filler Words

Time Gaps

Punctuation

Custom Hints

Stage 2: Grouping into Segments

Breaking Behavior

Stage 3: Merging Short Segments

Complete Example

Formatting Output

Timestamped Transcript

Custom Formatting

Advanced: Cleanup Isolated Tokens

Next Steps