piper/TRAINING.md

# Training Guide

Check out a [video training guide by Thorsten Müller](https://www.youtube.com/watch?v=b_we_jma220)

For Windows, see [ssamjh's guide using WSL](https://ssamjh.nz/create-custom-piper-tts-voice/)

---

Training a voice for Piper involves 3 main steps:

1. Preparing the dataset
2. Training the voice model
3. Exporting the voice model

Choices must be made at each step, including:

* The model "quality"
    * low = 16,000 Hz sample rate, [smaller voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L30)
    * medium = 22,050 Hz sample rate, [smaller voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L30)
    * high = 22,050 Hz sample rate, [larger voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L45)
* Single or multiple speakers
* Fine-tuning an [existing model](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main) or training from scratch
* Exporting to [onnx](https://github.com/microsoft/onnxruntime/) or PyTorch

## Getting Started

Start by installing system dependencies:

``` sh
sudo apt-get install python3-dev
```

Then create a Python virtual environment:

``` sh
cd piper/src/python
python3 -m venv .venv
source .venv/bin/activate
pip3 install --upgrade pip
pip3 install --upgrade wheel setuptools
pip3 install -e .
```

Run the `build_monotonic_align.sh` script in the `src/python` directory to build the extension.

Ensure you have [espeak-ng](https://github.com/espeak-ng/espeak-ng/) installed (`sudo apt-get install espeak-ng`).


## Preparing a Dataset

The Piper training scripts expect two files that can be generated by `python3 -m piper_train.preprocess`:

* A `config.json` file with the voice settings
    * `audio` (required)
        * `sample_rate` - audio rate in hertz
    * `espeak` (required)
        * `language` - espeak-ng voice or [alphabet](https://github.com/rhasspy/piper-phonemize/blob/master/src/phoneme_ids.cpp)
    * `num_symbols` (required)
        * Number of phonemes in the model (typically 256)
    * `num_speakers` (required)
        * Number of speakers in the dataset
    * `phoneme_id_map` (required)
        * Map from a phoneme (UTF-8 codepoint) to a list of ids
        * Id 0 ("_") is padding (pad)
        * Id 1 ("^") is the beginning of an utterance (bos)
        * Id 2 ("$") is the end of an utterance (eos)
        * Id 3 (" ") is a word separator (whitespace)
    * `phoneme_type`
        * "espeak" or "text"
        * "espeak" phonemes use [espeak-ng](https://github.com/rhasspy/espeak-ng)
        * "text" phonemes use a pre-defined [alphabet](https://github.com/rhasspy/piper-phonemize/blob/master/src/phoneme_ids.cpp)
    * `speaker_id_map`
        * Map from a speaker name to id
    * `phoneme_map`
        * Map from a phoneme (UTF-8 codepoint) to a list of phonemes
    * `inference`
        * `noise_scale` - noise added to the generator (default: 0.667)
        * `length_scale` - speaking speed (default: 1.0)
        * `noise_w` - phoneme width variation (default: 0.8) 
* A `dataset.jsonl` file with one line per utterance (JSON objects)
    * `phoneme_ids` (required)
        * List of ids for each utterance phoneme (0 <= id < `num_symbols`)
    * `audio_norm_path` (required)
        * Absolute path to [normalized audio](https://github.com/rhasspy/piper/tree/master/src/python/piper_train/norm_audio) file (`.pt`)
    * `audio_spec_path` (required)
        * Absolute path to [audio spectrogram](https://github.com/rhasspy/piper/blob/fda64e7a5104810a24eb102b880fc5c2ac596a38/src/python/piper_train/vits/mel_processing.py#L40) file (`.pt`)
    * `speaker_id` (required for multi-speaker)
        * Id of the utterance's speaker (0 <= id < `num_speakers`)
    * `audio_path`
        * Absolute path to original audio file
    * `text`
        * Original text of utterance before phonemization
    * `phonemes`
        * Phonemes from utterance text before converting to ids
    * `speaker`
        * Name of utterance speaker (from `speaker_id_map`)


### Dataset Format

The pre-processing script expects data to be a directory with:

* `metadata.csv` - CSV file with text, audio filenames, and speaker names
* `wav/` - directory with audio files

The `metadata.csv` file uses `|` as a delimiter, and has 2 or 3 columns depending on if the dataset has a single or multiple speakers.
There is no header row.

For single speaker datasets:

```csv
id|text
```

where `id` is the name of the WAV file in the `wav` directory. For example, an `id` of `1234` means that `wav/1234.wav` should exist. 

For multi-speaker datasets:

```csv
id|speaker|text
```

where `speaker` is the name of the utterance's speaker. Speaker ids will automatically be assigned based on the number of utterances per speaker (speaker id 0 has the most utterances).


### Pre-processing

An example of pre-processing a single speaker dataset:

``` sh
python3 -m piper_train.preprocess \
  --language en-us \
  --input-dir /path/to/dataset_dir/ \
  --output-dir /path/to/training_dir/ \
  --dataset-format ljspeech \
  --single-speaker \
  --sample-rate 22050
```

The `--language` argument refers to an [espeak-ng voice](https://github.com/espeak-ng/espeak-ng/) by default, such as `de` for German.

To pre-process a multi-speaker dataset, remove the `--single-speaker` flag and ensure that your dataset has the 3 columns: `id|speaker|text`
Verify the number of speakers in the generated `config.json` file before proceeding.


## Training a Model

Once you have a `config.json`, `dataset.jsonl`, and audio files (`.pt`) from pre-processing, you can begin the training process with `python3 -m piper_train`

For most cases, you should fine-tune from [an existing model](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main). The model must have the sample audio quality and sample rate, but does not necessarily need to be in the same language.

It is **highly recommended** to train with the following `Dockerfile`:

``` dockerfile
FROM nvcr.io/nvidia/pytorch:22.03-py3

RUN pip3 install \
    'pytorch-lightning'

ENV NUMBA_CACHE_DIR=.numba_cache
```

As an example, we will fine-tune the [medium quality lessac voice](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main/en/en_US/lessac/medium). Download the `.ckpt` file and run the following command in your training environment:

``` sh
python3 -m piper_train \
    --dataset-dir /path/to/training_dir/ \
    --accelerator 'gpu' \
    --devices 1 \
    --batch-size 32 \
    --validation-split 0.0 \
    --num-test-examples 0 \
    --max_epochs 10000 \
    --resume_from_checkpoint /path/to/lessac/epoch=2164-step=1355540.ckpt \
    --checkpoint-epochs 1 \
    --precision 32
```

Use `--quality high` to train a [larger voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L45) (sounds better, but is much slower).

You can adjust the validation split (5% = 0.05) and number of test examples for your specific dataset. For fine-tuning, they are often set to 0 because the target dataset is very small.

Batch size can be tricky to get right. It depends on the size of your GPU's vRAM, the model's quality/size, and the length of the longest sentence in your dataset. The `--max-phoneme-ids <N>` argument to `piper_train` will drop sentences that have more than `N` phoneme ids. In practice, using `--batch-size 32` and `--max-phoneme-ids 400` will work for 24 GB of vRAM (RTX 3090/4090).


### Multi-Speaker Fine-Tuning

If you're training a multi-speaker model, use `--resume_from_single_speaker_checkpoint` instead of `--resume_from_checkpoint`. This will be *much* faster than training your multi-speaker model from scratch.


### Testing

To test your voice during training, you can use [these test sentences](https://github.com/rhasspy/piper/tree/master/etc/test_sentences) or generate your own with [piper-phonemize](https://github.com/rhasspy/piper-phonemize/). Run the following command to generate audio files:

```sh
cat test_en-us.jsonl | \
    python3 -m piper_train.infer \
        --sample-rate 22050 \
        --checkpoint /path/to/training_dir/lightning_logs/version_0/checkpoints/*.ckpt \
        --output-dir /path/to/training_dir/output"
```

The input format to `piper_train.infer` is the same as `dataset.jsonl`: one line of JSON per utterance with `phoneme_ids` and `speaker_id` (multi-speaker only). Generate your own test file with [piper-phonemize](https://github.com/rhasspy/piper-phonemize/):

```sh
lib/piper_phonemize -l en-us --espeak-data lib/espeak-ng-data/ < my_test_sentences.txt > my_test_phonemes.jsonl
```


### Tensorboard

Check on your model's progress with tensorboard:

```sh
tensorboard --logdir /path/to/training_dir/lightning_logs
```

Click on the scalars tab and look at both `loss_disc_all` and `loss_gen_all`. In general, the model is "done" when `loss_disc_all` levels off. We've found that 2000 epochs is usually good for models trained from scratch, and an additional 1000 epochs when fine-tuning.


## Exporting a Model

When your model is finished training, export it to onnx with:

```sh
python3 -m piper_train.export_onnx \
    /path/to/model.ckpt \
    /path/to/model.onnx
    
cp /path/to/training_dir/config.json \
   /path/to/model.onnx.json
```

The [export script](https://github.com/rhasspy/piper-samples/blob/master/_script/export.sh) does additional optimization of the model with [onnx-simplifier](https://github.com/daquexian/onnx-simplifier).

If the export is successful, you can now use your voice with Piper:

```sh
echo 'This is a test.' | \
  piper -m /path/to/model.onnx --output_file test.wav
```
Actually add training guide 11 months ago			`# Training Guide`

Add Thorsten training video 10 months ago			`Check out a [video training guide by Thorsten Müller](https://www.youtube.com/watch?v=b_we_jma220)`

Add WSL training guide 10 months ago			`For Windows, see [ssamjh's guide using WSL](https://ssamjh.nz/create-custom-piper-tts-voice/)`

			`---`

Actually add training guide 11 months ago			`Training a voice for Piper involves 3 main steps:`

			`1. Preparing the dataset`
			`2. Training the voice model`
			`3. Exporting the voice model`

			`Choices must be made at each step, including:`

			`* The model "quality"`
			`* low = 16,000 Hz sample rate, [smaller voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L30)`
			`* medium = 22,050 Hz sample rate, [smaller voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L30)`
			`* high = 22,050 Hz sample rate, [larger voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L45)`
			`* Single or multiple speakers`
			`* Fine-tuning an [existing model](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main) or training from scratch`
			`* Exporting to [onnx](https://github.com/microsoft/onnxruntime/) or PyTorch`

			`## Getting Started`

			`Start by installing system dependencies:`

			``` sh
			`sudo apt-get install python3-dev`
			```

			`Then create a Python virtual environment:`

			``` sh
			`cd piper/src/python`
			`python3 -m venv .venv`
			`source .venv/bin/activate`
			`pip3 install --upgrade pip`
			`pip3 install --upgrade wheel setuptools`
Add Thorsten training video 10 months ago			`pip3 install -e .`
Actually add training guide 11 months ago			```

			Run the `build_monotonic_align.sh` script in the `src/python` directory to build the extension.

			Ensure you have [espeak-ng](https://github.com/espeak-ng/espeak-ng/) installed (`sudo apt-get install espeak-ng`).


			`## Preparing a Dataset`

			The Piper training scripts expect two files that can be generated by `python3 -m piper_train.preprocess`:

			* A `config.json` file with the voice settings
			* `audio` (required)
			* `sample_rate` - audio rate in hertz
			* `espeak` (required)
			* `language` - espeak-ng voice or [alphabet](https://github.com/rhasspy/piper-phonemize/blob/master/src/phoneme_ids.cpp)
			* `num_symbols` (required)
			`* Number of phonemes in the model (typically 256)`
			* `num_speakers` (required)
			`* Number of speakers in the dataset`
			* `phoneme_id_map` (required)
			`* Map from a phoneme (UTF-8 codepoint) to a list of ids`
			`* Id 0 ("_") is padding (pad)`
			`* Id 1 ("^") is the beginning of an utterance (bos)`
			`* Id 2 ("$") is the end of an utterance (eos)`
			`* Id 3 (" ") is a word separator (whitespace)`
			* `phoneme_type`
			`* "espeak" or "text"`
			`* "espeak" phonemes use [espeak-ng](https://github.com/rhasspy/espeak-ng)`
			`* "text" phonemes use a pre-defined [alphabet](https://github.com/rhasspy/piper-phonemize/blob/master/src/phoneme_ids.cpp)`
			* `speaker_id_map`
			`* Map from a speaker name to id`
			* `phoneme_map`
			`* Map from a phoneme (UTF-8 codepoint) to a list of phonemes`
			* `inference`
			* `noise_scale` - noise added to the generator (default: 0.667)
			* `length_scale` - speaking speed (default: 1.0)
			* `noise_w` - phoneme width variation (default: 0.8)
			* A `dataset.jsonl` file with one line per utterance (JSON objects)
			* `phoneme_ids` (required)
			* List of ids for each utterance phoneme (0 <= id < `num_symbols`)
			* `audio_norm_path` (required)
			* Absolute path to [normalized audio](https://github.com/rhasspy/piper/tree/master/src/python/piper_train/norm_audio) file (`.pt`)
			* `audio_spec_path` (required)
			* Absolute path to [audio spectrogram](https://github.com/rhasspy/piper/blob/fda64e7a5104810a24eb102b880fc5c2ac596a38/src/python/piper_train/vits/mel_processing.py#L40) file (`.pt`)
			* `speaker_id` (required for multi-speaker)
			* Id of the utterance's speaker (0 <= id < `num_speakers`)
			* `audio_path`
			`* Absolute path to original audio file`
			* `text`
			`* Original text of utterance before phonemization`
			* `phonemes`
			`* Phonemes from utterance text before converting to ids`
			* `speaker`
			* Name of utterance speaker (from `speaker_id_map`)


			`### Dataset Format`

			`The pre-processing script expects data to be a directory with:`

			* `metadata.csv` - CSV file with text, audio filenames, and speaker names
			* `wav/` - directory with audio files

			The `metadata.csv` file uses `\|` as a delimiter, and has 2 or 3 columns depending on if the dataset has a single or multiple speakers.
			`There is no header row.`

			`For single speaker datasets:`

			```csv
			`id\|text`
			```

			where `id` is the name of the WAV file in the `wav` directory. For example, an `id` of `1234` means that `wav/1234.wav` should exist.

			`For multi-speaker datasets:`

			```csv
			`id\|speaker\|text`
			```

			where `speaker` is the name of the utterance's speaker. Speaker ids will automatically be assigned based on the number of utterances per speaker (speaker id 0 has the most utterances).


			`### Pre-processing`

			`An example of pre-processing a single speaker dataset:`

			``` sh
			`python3 -m piper_train.preprocess \`
			`--language en-us \`
			`--input-dir /path/to/dataset_dir/ \`
			`--output-dir /path/to/training_dir/ \`
			`--dataset-format ljspeech \`
			`--single-speaker \`
			`--sample-rate 22050`
			```

			The `--language` argument refers to an [espeak-ng voice](https://github.com/espeak-ng/espeak-ng/) by default, such as `de` for German.

			To pre-process a multi-speaker dataset, remove the `--single-speaker` flag and ensure that your dataset has the 3 columns: `id\|speaker\|text`
			Verify the number of speakers in the generated `config.json` file before proceeding.


			`## Training a Model`

			Once you have a `config.json`, `dataset.jsonl`, and audio files (`.pt`) from pre-processing, you can begin the training process with `python3 -m piper_train`

			`For most cases, you should fine-tune from [an existing model](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main). The model must have the sample audio quality and sample rate, but does not necessarily need to be in the same language.`

			It is highly recommended to train with the following `Dockerfile`:

			``` dockerfile
			`FROM nvcr.io/nvidia/pytorch:22.03-py3`

			`RUN pip3 install \`
			`'pytorch-lightning'`

			`ENV NUMBA_CACHE_DIR=.numba_cache`
			```

			As an example, we will fine-tune the [medium quality lessac voice](https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main/en/en_US/lessac/medium). Download the `.ckpt` file and run the following command in your training environment:

			``` sh
			`python3 -m piper_train \`
			`--dataset-dir /path/to/training_dir/ \`
			`--accelerator 'gpu' \`
			`--devices 1 \`
			`--batch-size 32 \`
			`--validation-split 0.0 \`
			`--num-test-examples 0 \`
			`--max_epochs 10000 \`
			`--resume_from_checkpoint /path/to/lessac/epoch=2164-step=1355540.ckpt \`
			`--checkpoint-epochs 1 \`
			`--precision 32`
			```

Add note for high quality training 11 months ago			Use `--quality high` to train a [larger voice model](https://github.com/rhasspy/piper/blob/master/src/python/piper_train/vits/config.py#L45) (sounds better, but is much slower).

Actually add training guide 11 months ago			`You can adjust the validation split (5% = 0.05) and number of test examples for your specific dataset. For fine-tuning, they are often set to 0 because the target dataset is very small.`

			Batch size can be tricky to get right. It depends on the size of your GPU's vRAM, the model's quality/size, and the length of the longest sentence in your dataset. The `--max-phoneme-ids <N>` argument to `piper_train` will drop sentences that have more than `N` phoneme ids. In practice, using `--batch-size 32` and `--max-phoneme-ids 400` will work for 24 GB of vRAM (RTX 3090/4090).


			`### Multi-Speaker Fine-Tuning`

			If you're training a multi-speaker model, use `--resume_from_single_speaker_checkpoint` instead of `--resume_from_checkpoint`. This will be much faster than training your multi-speaker model from scratch.


			`### Testing`

			`To test your voice during training, you can use [these test sentences](https://github.com/rhasspy/piper/tree/master/etc/test_sentences) or generate your own with [piper-phonemize](https://github.com/rhasspy/piper-phonemize/). Run the following command to generate audio files:`

			```sh
			`cat test_en-us.jsonl \| \`
			`python3 -m piper_train.infer \`
			`--sample-rate 22050 \`
			`--checkpoint /path/to/training_dir/lightning_logs/version_0/checkpoints/*.ckpt \`
			`--output-dir /path/to/training_dir/output"`
			```

			The input format to `piper_train.infer` is the same as `dataset.jsonl`: one line of JSON per utterance with `phoneme_ids` and `speaker_id` (multi-speaker only). Generate your own test file with [piper-phonemize](https://github.com/rhasspy/piper-phonemize/):

			```sh
			`lib/piper_phonemize -l en-us --espeak-data lib/espeak-ng-data/ < my_test_sentences.txt > my_test_phonemes.jsonl`
			```


			`### Tensorboard`

			`Check on your model's progress with tensorboard:`

			```sh
			`tensorboard --logdir /path/to/training_dir/lightning_logs`
			```

			Click on the scalars tab and look at both `loss_disc_all` and `loss_gen_all`. In general, the model is "done" when `loss_disc_all` levels off. We've found that 2000 epochs is usually good for models trained from scratch, and an additional 1000 epochs when fine-tuning.


			`## Exporting a Model`

			`When your model is finished training, export it to onnx with:`

			```sh
			`python3 -m piper_train.export_onnx \`
			`/path/to/model.ckpt \`
			`/path/to/model.onnx`

			`cp /path/to/training_dir/config.json \`
			`/path/to/model.onnx.json`
			```

			`The [export script](https://github.com/rhasspy/piper-samples/blob/master/_script/export.sh) does additional optimization of the model with [onnx-simplifier](https://github.com/daquexian/onnx-simplifier).`

			`If the export is successful, you can now use your voice with Piper:`

			```sh
			`echo 'This is a test.' \| \`
			`piper -m /path/to/model.onnx --output_file test.wav`
			```