Latest Post

A Wide List of Phoneme-Labeled (Human & Force-Aligned) Speech Datasets Across High- and Low-Resource Languages

Introduction

Phoneme recognition is a fundamental problem in speech processing, particularly in low-resource speech recognition, because collecting human-labeled, time-stamped phoneme annotations is an expensive and labor-intensive job — and it is especially challenging for low-resource languages. Below is a compiled list of available resources for phoneme-labeled datasets, spanning hand-labeled corpora, force-aligned corpora, and multilingual and endangered-language collections.

1. Hand-Labeled Time-Aligned Phoneme Datasets — English

Dataset Language Size Annotation Type Speech Type Notes
TIMIT
MIT / SRI / TI, 1993
English 5.4 hrs, 630 speakers Expert hand-labeled, time-aligned Read speech 60-phoneme ARPAbet inventory, foldable to 48 or 39. Gold standard for phoneme recognition. Available via LDC (LDC93S1).
Buckeye Corpus
Pitt et al., 2007
English 40 hrs, 40 speakers Hand-corrected, time-aligned Spontaneous / conversational Phoneme- and word-level annotations; hand-verified alignments. Ohio American English speakers. Freely available.
TORGO
Rudzicz et al., 2010
English 23 hrs, 15 speakers Expert labeled, time-aligned Read / elicited (dysarthric + control) Dysarthric speech from CP/ALS speakers + controls. Phoneme + articulatory (EMA) annotations. Clinically assessed. Publicly available.
MOCHA-TIMIT
Univ. of Edinburgh, 1999
English 0.5 hrs, 2 speakers Expert labeled, time-aligned Read speech 460 TIMIT sentences with acoustic + EMA articulatory measurements. Phoneme boundaries hand-annotated. Free download.
OGI Multi-Language Telephone Speech
OGI, 1994
English (+9 others) 3 hrs Expert labeled Telephone / spontaneous Human-labeled phoneme segments for 10 languages including English; used in LID and phoneme recognition research. LDC94S17.

2. Human and Force-Aligned Time-Stamped Phoneme Datasets — Japanese

Dataset Language Size Annotation Type Speech Type Notes
CSJ (Corpus of Spontaneous Japanese)
Maekawa, 2003
Japanese 660 hrs, 1,400+ speakers Expert (core subset), auto (rest) Spontaneous (academic/public speech) Core subset (45 hrs, 500K words) has manual phonetic labels + intonation annotation by experts. Full corpus has orthographic-only. NII Japan.
JSUT
Sonobe et al., 2017
Japanese 10 hrs, 1 speaker Manual TTS labels Read speech (single female) 5,000 sentences with manually annotated phoneme + prosody labels for TTS. Freely available on GitHub.

3. Phoneme-Labeled Dataset — Arabic

Dataset Language Size Annotation Type Speech Type Notes
Arabic Speech Corpus (MSA)
Halabi, 2016
Arabic 3.7 hrs, 1 speaker Expert phoneme-level, time-aligned Read speech (MSA) Phonetic + orthographic transcriptions with word stress marks. Built for Arabic TTS/ASR research. Publicly available.

4. Phoneme-Labeled Dataset — Chinese

Dataset Language Size Annotation Type Speech Type Notes
AISHELL-3
AISHELL, 2020
Mandarin 85 hrs, 218 speakers Professional annotation Read speech (TTS) Word- and tone-level transcriptions professionally annotated. Pinyin/phoneme-level labels included. Tone/prosody accuracy >98%. Free for research.

5. Forced-Alignment Time-Aligned Phoneme Datasets — English

Dataset Language Size Annotation Type Speech Type Notes
LibriSpeech English 960 hrs, 2,484 speakers Force-aligned labeled Read audiobook speech Phoneme labels obtained through forced alignment. Source: openslr.org/12
Switchboard
Godfrey et al., 1992
English 260 hrs, 543 speakers Partial expert, forced-align (rest) Telephone / conversational 5,000 utterances have human phoneme transcriptions (Greenberg et al., 1996); rest are G2P/forced-aligned. LDC2002T43.

6. Multilingual Force- / Human-Aligned Phoneme Datasets

Dataset Language(s) Size Annotation Type Speech Type Notes
GlobalPhone
Schultz, 2002
22 languages 400+ hrs, 1,900+ speakers Native speaker verified Read speech (news/text) Covers Arabic, Chinese, German, French, Japanese, Russian, Thai, and more. Phonetic transcriptions via native experts. Commercial license (LDC/ELRA).
NIST BABEL
IARPA / LDC, 2013–2016
25 languages 40 hrs/language Human transcribed, forced align Conversational telephone Human-annotated word-level transcriptions; phoneme labels derived via language-specific lexicons + forced alignment. Restricted/LDC license.
Mboshi Corpus
BULB Project, 2017
Mboshi (Bantu) 4.4 hrs, 1 speaker Linguist transcribed Elicited / field recordings 5,130 utterances. Phoneme-level transcriptions by linguists; low-resource language documentation corpus. Freely available on GitHub.
DanPASS
Grønnum, 2009
Danish 2 hrs Expert phonetic annotation Read speech (dialogue-based) Phonetically annotated spontaneous speech corpus for Danish. Segmented with Praat by trained phoneticians.
QuranMB.v2
IQRA Challenge, 2025
Arabic 1,643 utterances 3 expert linguists Quranic recitation Phoneme sequences validated by three Arabic linguistic experts. Used for Arabic pronunciation assessment benchmarking.

7. Human Phoneme-Labeled Dataset — Urdu

Dataset Language Size Annotation Type Speech Type Notes
Urdu Phonetically Rich Speech Corpus
PRUS / CSaLT · Raza et al., 2009
Urdu 70 min / 708 sentences Expert linguists for phoneme, force-aligned labeled Read speech Greedy sentence selection covering all Urdu phonemic and triphonemic combinations.

8. Endangered / Low-Resource Force-Aligned Phoneme Datasets (MAUS-Aligner) — DoReCo, 53 Languages

The DoReCo project provides MAUS-aligned, word-level manually verified phoneme annotations for 53 typologically diverse, largely endangered or under-documented languages, most captured as narrative or conversational field recordings.

# Language (Glottocode) Family Region Size (approx.) Speakers Annotation Type Speech Type
1Anal (anal1239)Sino-TibetanPapua/SE Asia2 hrs5MAUS auto-align, word-lvl manualNarrative / monologue
2Arapaho (arap1274)AlgicN. America1.4 hrs10MAUS auto-align, word-lvl manualNarrative
3Asimjeeg Datooga (dato1239)NiloticAfrica (Tanzania)2 hrs10MAUS auto-align, word-lvl manualNarrative
4Baïnounk Gubëeher (bain1259)Atlantic-CongoAfrica (Senegal)2 hrs8MAUS auto-align, word-lvl manualNarrative
5Beja (beja1238)Afro-AsiaticAfrica (Sudan/Eritrea)1.8 hrs6MAUS auto-align, word-lvl manualNarrative
6Bora (bora1263)BoranS. America (Peru)1.7 hrs8MAUS auto-align, word-lvl manualNarrative
7Cabécar (cabe1245)ChibchanC. America (Costa Rica)2 hrs5MAUS auto-align, word-lvl manualNarrative
8Cashinahua (cash1254)PanoanS. America (Peru/Brazil)2 hrs6MAUS auto-align, word-lvl manualNarrative
9Daakie (port1286)AustronesianPacific (Vanuatu)0.9 hrs5MAUS auto-align, word-lvl manualNarrative
10Dalabon (dala1245)GunwinyguanAustralia2 hrs4MAUS auto-align, word-lvl manualNarrative
11Dolgan (dolg1241)TurkicEurasia (Siberia)2 hrs8MAUS auto-align, word-lvl manualNarrative
12English (DoReCo) (stan1293)Indo-EuropeanEurasia2 hrs5MAUS auto-align, word-lvl manualNarrative
13Evenki (even1259)TungusicEurasia (Siberia)3 hrs6MAUS auto-align, word-lvl manualNarrative
14Fanbyak (orko1234)AustronesianPacific (Vanuatu)2 hrs6MAUS auto-align, word-lvl manualNarrative
15French (Swiss) (swis1247)Indo-EuropeanEurasia (Switzerland)2 hrs10MAUS auto-align, word-lvl manualConversation
16Goemai (goem1240)Afro-AsiaticAfrica (Nigeria)2 hrs5MAUS auto-align, word-lvl manualNarrative
17Gorwaa (goro1270)Afro-AsiaticAfrica (Tanzania)1 hr8MAUS auto-align, word-lvl manualNarrative
18Hoocąk (hoch1243)SiouanN. America (Wisconsin)2 hrs6MAUS auto-align, word-lvl manualNarrative
19Jahai (jaha1242)AustroasiaticSE Asia (Malaysia)2 hrs8MAUS auto-align, word-lvl manualNarrative / conversation
20Jejuan (jeju1234)KoreanicE. Asia (Korea)1 hr8MAUS auto-align, word-lvl manualNarrative
21Kakabe (kaka1277)MandeAfrica (Guinea)2 hrs5MAUS auto-align, word-lvl manualNarrative
22Kamas (kama1371)UralicEurasia (Siberia)1 hr1 (extinct)MAUS auto-align, word-lvl manualElicited (archival)
23Komnzo (komn1238)YamPapua New Guinea1.2 hrs5MAUS auto-align, word-lvl manualNarrative
24Light Warlpiri (ligh1234)Mixed / Pama-NyunganAustralia2 hrs8MAUS auto-align, word-lvl manualConversation
25Lower Sorbian (lowe1385)Indo-EuropeanEurasia (Germany)2 hrs8MAUS auto-align, word-lvl manualNarrative
26Mojeño Trinitario (trin1278)ArawakanS. America (Bolivia)1.6 hrs5MAUS auto-align, word-lvl manualNarrative
27Movima (movi1243)Language isolateS. America (Bolivia)1.3 hrs6MAUS auto-align, word-lvl manualNarrative
28Nafsan (S. Efate) (sout2856)AustronesianPacific (Vanuatu)2 hrs6MAUS auto-align, word-lvl manualNarrative
29Nisvai (nisv1234)AustronesianPacific (Vanuatu)2 hrs5MAUS auto-align, word-lvl manualNarrative
30Northern Alta (nort2875)AustronesianSE Asia (Philippines)2 hrs6MAUS auto-align, word-lvl manualNarrative
31N. Kurdish (Kurmanji) (nort2641)Indo-EuropeanEurasia (Middle East)2 hrs10MAUS auto-align, word-lvl manualNarrative
32Nǁng (nngg1234)Tuu (Khoisan)Africa (S. Africa)0.85 hrs4MAUS auto-align, word-lvl manualNarrative
33Pnar (pnar1238)AustroasiaticSE Asia (India)2 hrs6MAUS auto-align, word-lvl manualNarrative
34Resígaro (resi1247)ArawakanS. America (Peru)1 hr3MAUS auto-align, word-lvl manualNarrative
35Ruuli (ruul1235)Atlantic-Congo (Bantu)Africa (Uganda)1.2 hrs8MAUS auto-align, word-lvl manualNarrative
36Sadu (sadu1234)Sino-TibetanE. Asia (China)2 hrs5MAUS auto-align, word-lvl manualNarrative
37Sanzhi Dargwa (sanz1248)Nakh-DaghestanianEurasia (Caucasus)2 hrs8MAUS auto-align, word-lvl manualNarrative
38Savosavo (savo1255)Language isolateSolomon Islands1.3 hrs6MAUS auto-align, word-lvl manualNarrative
39Sümi (sumi1235)Sino-TibetanSE Asia (India/Nagaland)0.8 hrs5MAUS auto-align, word-lvl manualNarrative
40Svan (svan1243)KartvelianEurasia (Georgia)2 hrs6MAUS auto-align, word-lvl manualNarrative
41Tabaq (Karko) (kark1256)NubianAfrica (Sudan)2 hrs5MAUS auto-align, word-lvl manualNarrative
42Tabasaran (taba1259)Nakh-DaghestanianEurasia (Caucasus)2 hrs8MAUS auto-align, word-lvl manualNarrative
43Teop (teop1238)AustronesianPapua New Guinea1 hr6MAUS auto-align, word-lvl manualNarrative
44Texistepec Popoluca (texi1237)ZoqueC. America (Mexico)1.1 hrs1 (archival)MAUS auto-align, word-lvl manualNarrative (archival)
45Urum (urum1249)TurkicEurasia (Georgia)2 hrs30 (largest)MAUS auto-align, word-lvl manualNarrative
46Vera'a (vera1241)AustronesianPacific (Vanuatu)2 hrs6MAUS auto-align, word-lvl manualNarrative
47Warlpiri (warl1254)Pama-NyunganAustralia2 hrs8MAUS auto-align, word-lvl manualNarrative / conversation
48Yali (Apahapsili) (apah1238)Trans-New-GuineaPapua (Indonesia)0.45 hrs4MAUS auto-align, word-lvl manualNarrative
49Yongning Na (naxi1246)Sino-TibetanE. Asia (China/SW)2 hrs1 (single)MAUS auto-align, word-lvl manualNarrative
50Yucatec Maya (yuca1254)MayanC. America (Mexico)2 hrs8MAUS auto-align, word-lvl manualNarrative
51Yurakaré (yura1261)Language isolateS. America (Bolivia)2 hrs5MAUS auto-align, word-lvl manualNarrative
52Gurindji (guri1247)Pama-NyunganAustralia2 hrs8MAUS auto-align, word-lvl manualNarrative
53Totoli (toto1305)AustronesianSE Asia (Indonesia)2 hrs6MAUS auto-align, word-lvl manualNarrative
Takeaway:

High-resource languages such as English, Japanese, Mandarin, and Arabic benefit from large, professionally or expert-annotated phoneme corpora, while most of the world's languages rely on smaller, forced-aligned, or field-linguist-annotated resources. The DoReCo collection is currently the widest single source of phoneme-labeled data for endangered and under-documented languages, covering 53 languages with MAUS-based forced alignment verified at the word level.

Fine-tuning Pre-trained Models with Multi-tasks

Introduction

When a pre-trained model is fine-tuned for a single task (e.g., speech recognition), the model becomes more adapted to that task. However, it loses much of what it learnt during pre-training — a phenomenon commonly referred to as catastrophic forgetting.

Approach

In this work, we fine-tune a self-supervised model — specifically Wav2Vec2.0-base — for multiple downstream tasks simultaneously: speech recognition, emotion recognition, and speaker identification.

We prepared a speech dataset where each audio sample carries three labels: text transcription, emotion, and speaker identity. The dataset is sourced from the Combined Dataset for Speech Emotion Recognition and includes:

The pre-trained model is fine-tuned on approximately 22 hours of labeled speech data and evaluated on 4 hours of held-out test data. For speech recognition, the CTC objective is used, while for emotion recognition and speaker identification, cross-entropy loss is applied.

Results

The table below summarizes our experimental setup and results using a 22/4/4 hour train/validation/test split.

Fine-tune Data (hrs) Validation Data (hrs) Test Data (hrs) Speech Recognition (CER %) Emotion Recognition Accuracy Speaker Identification Accuracy
22 4 4 24% 60% 78%
Key Results:

We achieved a 24% Character Error Rate (CER) for speech recognition (without a language model), 60% accuracy in emotion recognition, and 78% accuracy in speaker identification. Emotion recognition proved to be the most challenging task in this multi-task setup.

Remarks

Multi-task learning is more challenging than standalone single-task training, because individual tasks tend to converge at different rates. A practical remedy is to assign different loss weights to each task, allowing the training process to give more attention to harder tasks and improve overall performance.

Is Phoneme Recognition a Solved Problem?

Introduction

Phoneme is the basic unit of sound in each language that distinguishes one word from another. Phoneme recognition refers to the task of converting speech signals into sequences of phonemic units. This task is particularly important for applications such as pronunciation training, language learning, and speech recognition.

Figure below shows the phoneme and speech recognition of a sample audio using the Praat Tool.

Data Preparation

Phoneme Recognition (PR) tasks require recorded utterances and corresponding phoneme sequences prepared by linguistic experts. Sometimes timestamps are also provided for frame-level recognition.

TIMIT is one of the most widely used datasets for phoneme recognition and is available from the Linguistic Data Consortium (LDC). TIMIT contains phoneme and word-level timestamps and is widely used for benchmarking state-of-the-art systems.

How It Works

PR is a classification task, and various machine learning algorithms are used for training, such as Multi-Layer Perceptrons (MLP), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN). The performance of phoneme recognition systems continues to improve rapidly.

Recently, Self-Supervised Learning (SSL) models have demonstrated state-of-the-art performance for PR tasks. SSL models are first trained using unlabeled speech data with a self-supervised objective. This allows the model to learn abstract speech representations from raw audio.

A linear layer is then attached to the pretrained SSL model and fine-tuned for downstream tasks such as phoneme recognition using limited labeled data.

Existing SSL models include:

In the next blog, we will compare the performance of these pretrained models on the TIMIT dataset.

Results

The following table summarizes selected phoneme recognition research papers from 2008–2020 evaluated on the TIMIT dataset using Phoneme Error Rate (PER).

# Paper Year PER (%)
1 Phoneme recognition in TIMIT with BLSTM-CTC 2008 24.4
2 Speech recognition with deep recurrent neural networks 2013 17.70
3 Attention-based recurrent neural networks 2014 18.57
4 Convolutional deep maxout networks 2014 17.76
5 Segmental recurrent neural networks 2016 17.30
6 Recurrent DNN ensembles on TIMIT 2018 14.69
7 wav2vec 2019 14.70
8 VQ-wav2vec 2019 11.64
9 wav2vec 2.0 2020 8.30
Observation:

The table shows that phoneme recognition performance has improved significantly, reducing the PER from 24.4% to 8.30% using self-supervised learning approaches such as wav2vec 2.0.

Discussion

Although current PR systems achieve impressive performance on English datasets such as TIMIT, these models are often pretrained and fine-tuned on the same high-resource language.

An important open question is whether similar performance can be achieved for low-resource and unseen languages.

Some important research questions include:

Collecting high-quality training data for low-resource languages remains challenging because recordings often contain environmental and background noise.