API Overview

Paragrafs provides a comprehensive TypeScript API for processing transcripts, aligning ground truth, and working with timestamped text.

API Sections

Transcript Builders

Functions for processing tokens into formatted segments with natural breaks

Ground Truth Alignment

Align AI-generated tokens with human-edited text using LCS matching

Editor Helpers

Utilities for finding tokens based on queries or text selections

Utility Functions

Helper functions for timestamps, punctuation, normalization, and more

Hint Generation

Auto-generate hints from repeated phrases in transcripts (Arabic-first)

Types

TypeScript types and interfaces used throughout the library

Quick Start

import {
  estimateSegmentFromToken,
  markAndCombineSegments,
  formatSegmentsToTimestampedTranscript,
} from 'paragrafs';

// Create a segment from a multi-word token
const token = {
  start: 0,
  end: 5,
  text: 'Hello world from paragrafs'
};
const segment = estimateSegmentFromToken(token);

// Process and format segments
const segments = [segment];
const processed = markAndCombineSegments(segments, {
  fillers: ['uh', 'umm'],
  gapThreshold: 3,
  maxSecondsPerSegment: 12,
  minWordsPerSegment: 3,
});

// Generate timestamped transcript
const transcript = formatSegmentsToTimestampedTranscript(processed, 10);
console.log(transcript);
// Output: "0:00: Hello world from paragrafs"

Core Concepts

Tokens

A Token represents a single word or phrase with timing information:

type Token = {
  start: number;  // Start time in seconds
  end: number;    // End time in seconds
  text: string;   // The transcribed text
};

Segments

A Segment is a higher-level structure containing multiple tokens:

type Segment = Token & {
  tokens: Token[];  // Word-by-word breakdown
};

Markers

The library uses special markers to indicate segment boundaries:

SEGMENT_BREAK - Soft break (can be ignored if duration constraints allow)
ALWAYS_BREAK - Hard break (must create a new segment/line)

Arabic-First Design

Many functions include Arabic-specific features:

Diacritics removal
Alef normalization (أإآ → ا)
Ya normalization (ى → ي)
Tatweel removal (ـ)
Arabic punctuation support (؟ ؛)

These normalizations ensure robust matching and hint generation for Arabic transcripts.

Getting Started

Core Concepts

Guides

API Reference

Resources

API Sections

Transcript Builders

Ground Truth Alignment

Editor Helpers

Utility Functions

Hint Generation

Types

Quick Start

Core Concepts

Tokens

Segments

Markers

Arabic-First Design

Getting Started

Core Concepts

Guides

API Reference

Resources

Documentation Index

​API Sections

Transcript Builders

Ground Truth Alignment

Editor Helpers

Utility Functions

Hint Generation

Types

​Quick Start

​Core Concepts

​Tokens

​Segments

​Markers

​Arabic-First Design

API Sections

Quick Start

Core Concepts

Tokens

Segments

Markers

Arabic-First Design