Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/ragaeeb/paragrafs/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Paragrafs provides first-class support for Arabic text processing, including comprehensive normalization options and diacritic-tolerant matching. This is essential for working with Arabic transcriptions from speech recognition systems.

Why Arabic Normalization?

Arabic text presents unique challenges:
  • Diacritics (تشكيل): The same word can appear with or without vowel marks
  • Alef variants: أ, إ, آ, ا are often used interchangeably
  • Ya variants: ى and ي represent the same sound
  • Hamza positions: ء, ؤ, ئ, أ, إ have different spellings
  • Tatweel: ـ (kashida) used for visual spacing
Paragrafs handles all these variations to ensure robust matching.

Normalization Options

The ArabicNormalizationOptions type controls how text is normalized:
type ArabicNormalizationOptions = {
  normalizeAlef?: boolean;   // Normalize all alef variants to ا
  normalizeHamza?: boolean;  // Normalize hamza seats to standalone ء
  normalizeYa?: boolean;     // Normalize ى to ي
  removeTatweel?: boolean;   // Remove tatweel (ـ)
};

Default Settings

Paragrafs uses Arabic-first defaults optimized for ASR:
// Default normalization for createHints and generateHints
{
  normalizeAlef: true,   // ا ← أ, إ, آ
  normalizeHamza: false, // Preserve hamza distinctions
  normalizeYa: true,     // ي ← ى  
  removeTatweel: true,   // Remove ـ
}

Creating Normalized Hints

Use createHints to create hints with normalization:
import { createHints } from 'paragrafs';

// Using default normalization
const hints = createHints(
  'أحسن الله إليكم',
  'بارك الله فيكم',
  'جزاكم الله خيرا'
);

// Custom normalization
const customHints = createHints(
  { normalizeAlef: true, normalizeYa: true, removeTatweel: true },
  'أحسن الله إليكم',
  'بارك الله فيكم'
);
Hints are normalized at creation time. During matching, tokens are normalized using the same options to ensure consistent comparison.

How Normalization Works

1

Unicode decomposition (NFD)

Text is decomposed to separate base characters from combining marks.
2

Remove diacritics

Arabic diacritics (harakat) and combining marks are removed: ً ٌ ٍ َ ُ ِ ّ ْ ٓ ٰ
3

Remove zero-width characters

Invisible formatting characters are removed: \u200B, \u200C, \u200D, \uFEFF
4

Strip punctuation

Leading and trailing punctuation is removed from each word.
5

Apply normalization options

Optional normalizations (alef, ya, hamza, tatweel) are applied.
6

Unicode composition (NFC)

Text is recomposed to standard Unicode form.

Manual Text Normalization

You can normalize text directly using normalizeTokenText:
import { normalizeTokenText } from 'paragrafs';

// Default normalization
const normalized = normalizeTokenText('أَحْسَنَ');
console.log(normalized); // "احسن"

// Custom normalization
const custom = normalizeTokenText('إليكم،', {
  normalizeAlef: true,
  normalizeYa: true,
  removeTatweel: true,
});
console.log(custom); // "اليكم"

Simple Word Normalization

For basic diacritic removal, use normalizeWord:
import { normalizeWord } from 'paragrafs';

const word = normalizeWord('أَحْسَنَ');
console.log(word); // "أحسن" (diacritics removed, but alef variant preserved)

Normalization Examples

Alef Normalization

import { normalizeTokenText } from 'paragrafs';

const variants = ['أحمد', 'إحسان', 'آمن', 'الله'];

const normalized = variants.map(v => 
  normalizeTokenText(v, { normalizeAlef: true })
);

console.log(normalized);
// ["احمد", "احسان", "امن", "الله"]

Ya Normalization

const variants = ['إلى', 'علي'];

const normalized = variants.map(v => 
  normalizeTokenText(v, { normalizeYa: true })
);

console.log(normalized);
// ["إلي", "علي"] (ى → ي)

Hamza Normalization

const variants = ['سؤال', 'مئة', 'شيء'];

const normalized = variants.map(v => 
  normalizeTokenText(v, { normalizeHamza: true })
);

console.log(normalized);
// ["سءال", "مءة", "شيء"]

Tatweel Removal

const stretched = 'اللـــــه';

const normalized = normalizeTokenText(stretched, { 
  removeTatweel: true 
});

console.log(normalized); // "الله"

Hint Matching with Normalization

When hints are used, tokens are automatically normalized for matching:
import { createHints, markTokensWithDividers } from 'paragrafs';

// Create hints with normalization
const hints = createHints(
  { normalizeAlef: true, normalizeYa: true },
  'أحسن الله إليكم'
);

// These tokens will match even with different diacritics/alef variants
const tokens = [
  { start: 0, end: 1, text: 'أَحْسَنَ' },   // Different diacritics
  { start: 1, end: 2, text: 'اللهُ' },      // Different alef in الله
  { start: 2, end: 3, text: 'إلَيْكُمْ' },  // إ vs ا, different diacritics
];

const marked = markTokensWithDividers(tokens, {
  fillers: [],
  gapThreshold: 999,
  hints,
});

// An ALWAYS_BREAK marker will be inserted before the matched hint

Multi-Word Hint Matching

Hints support multi-word phrases with robust normalization:
import { createHints, markTokensWithDividers } from 'paragrafs';

const hints = createHints(
  { normalizeAlef: true, normalizeYa: true },
  'أحسن الله إليكم',      // 3-word phrase
  'بارك الله فيكم',        // 3-word phrase  
  'جزاكم الله خيرا'       // 3-word phrase
);

const tokens = [
  { start: 0, end: 1, text: 'أَحْسَنَ' },
  { start: 1, end: 2, text: 'اللهُ' },
  { start: 2, end: 3, text: 'إلَيْكُمْ،' },
  { start: 4, end: 5, text: 'الحمد' },
  { start: 5, end: 6, text: 'لله' },
];

const marked = markTokensWithDividers(tokens, {
  fillers: [],
  gapThreshold: 2,
  hints,
});

// The 3-word phrase "أحسن الله إليكم" is matched despite variations

Punctuation Handling

Paragrafs recognizes Arabic punctuation for segment breaks:
import { isEndingWithPunctuation } from 'paragrafs';

// Arabic and English punctuation
console.log(isEndingWithPunctuation('السلام.')); // true (.)
console.log(isEndingWithPunctuation('كيف حالك؟')); // true (؟)
console.log(isEndingWithPunctuation('مرحبا!')); // true (!)
console.log(isEndingWithPunctuation('والله؛')); // true (؛)
console.log(isEndingWithPunctuation('نعم…')); // true (…)
Supported punctuation marks:
  • Period: .
  • Question mark: ? and ؟ (Arabic)
  • Exclamation: !
  • Semicolon: ؛ (Arabic)
  • Ellipsis:

Complete Example

Here’s a full workflow for processing Arabic transcriptions:
import {
  generateHintsFromSegments,
  createHints,
  markAndCombineSegments,
  formatSegmentsToTimestampedTranscript,
} from 'paragrafs';

// Arabic transcription segments
const segments = [
  {
    start: 0,
    end: 5,
    text: 'بسم الله الرحمن الرحيم',
    tokens: [
      { start: 0, end: 1, text: 'بسم' },
      { start: 1, end: 2, text: 'الله' },
      { start: 2, end: 3, text: 'الرحمن' },
      { start: 3, end: 4, text: 'الرحيم' },
    ],
  },
  // ... more segments ...
];

// Step 1: Mine common Arabic phrases
const mined = generateHintsFromSegments(segments, {
  minN: 2,
  maxN: 5,
  minCount: 2,
  normalization: {
    normalizeAlef: true,
    normalizeYa: true,
    removeTatweel: true,
  },
});

// Step 2: Create hints from discovered phrases
const hints = createHints(
  { normalizeAlef: true, normalizeYa: true },
  ...mined.slice(0, 20).map(h => h.phrase)
);

// Step 3: Process with hints
const options = {
  fillers: [],
  gapThreshold: 2,
  maxSecondsPerSegment: 15,
  minWordsPerSegment: 4,
  hints,
};

const marked = markAndCombineSegments(segments, options);
const transcript = formatSegmentsToTimestampedTranscript(marked, 10);

console.log(transcript);

Best Practices

Ensure createHints, generateHintsFromTokens, and markTokensWithDividers all use the same normalization options. Mismatched settings will prevent hints from matching.
For Arabic ASR, always enable normalizeAlef and normalizeYa as these variants are commonly confused by speech recognition systems.
Hamza normalization (normalizeHamza) can be aggressive and may collapse semantically different words. Only enable if you’re seeing hamza-related matching issues.
If your transcriptions mix Arabic and English, test normalization on sample data to ensure English words aren’t affected unexpectedly.

Normalization Reference

Characters Affected

OptionInput CharactersOutput
normalizeAlefأ إ آا
normalizeYaىي
normalizeHamzaؤ (waw+hamza)
ئ (ya+hamza)
ء
ء
removeTatweelـ(removed)
Always removedَ ُ ِ ً ٌ ٍ ّ ْ ٓ ٰ(removed)

Unicode Ranges

  • Arabic diacritics: \u064B-\u065F
  • Combining marks: \p{Mn} (Unicode category)
  • Zero-width chars: \u200B-\u200D, \uFEFF
  • Punctuation: \p{P} (Unicode category)

Next Steps

Basic Usage

Review the fundamentals of Paragrafs

Auto-Hint Generation

Learn how to mine frequent Arabic phrases