misaki0.9.4
Published
G2P engine for TTS
pip install misaki
Package Downloads
Authors
Requires Python
<3.13,>=3.8
Dependencies
- addict
- regex
- espeakng-loader
; extra == "en" - num2words
; extra == "en" - phonemizer-fork
; extra == "en" - spacy
; extra == "en" - spacy-curated-transformers
; extra == "en" - mishkal-hebrew
>=0.3.2; extra == "he" - fugashi
; extra == "ja" - jaconv
; extra == "ja" - mojimoji
; extra == "ja" - pyopenjtalk
; extra == "ja" - unidic
; extra == "ja" - jamo
; extra == "ko" - nltk
; extra == "ko" - num2words
; extra == "vi" - spacy
; extra == "vi" - spacy-curated-transformers
; extra == "vi" - underthesea
; extra == "vi" - cn2an
; extra == "zh" - jieba
; extra == "zh" - ordered-set
; extra == "zh" - pypinyin
; extra == "zh" - pypinyin-dict
; extra == "zh"
misaki
Misaki is a G2P engine designed for Kokoro models.
Hosted demo: https://hf.co/spaces/hexgrad/Misaki-G2P
English Usage
You can run this in one cell on Google Colab:
!pip install -q "misaki[en]"
from misaki import en
g2p = en.G2P(trf=False, british=False, fallback=None) # no transformer, American English
text = '[Misaki](/misˈɑki/) is a G2P engine designed for [Kokoro](/kˈOkəɹO/) models.'
phonemes, tokens = g2p(text)
print(phonemes) # misˈɑki ɪz ə ʤˈitəpˈi ˈɛnʤən dəzˈInd fɔɹ kˈOkəɹO mˈɑdᵊlz.
To fallback to espeak:
# Installing espeak varies across platforms, this silent install works on Colab:
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!pip install -q "misaki[en]" phonemizer-fork
from misaki import en, espeak
fallback = espeak.EspeakFallback(british=False) # en-us
g2p = en.G2P(trf=False, british=False, fallback=fallback) # no transformer, American English
text = 'Now outofdictionary words are handled by espeak.'
phonemes, tokens = g2p(text)
print(phonemes) # nˈW Wɾɑfdˈɪkʃənˌɛɹi wˈɜɹdz ɑɹ hˈændəld bI ˈispik.
English
- https://github.com/explosion/spaCy
- https://github.com/savoirfairelinux/num2words
- https://github.com/hexgrad/misaki/blob/main/EN_PHONES.md
Japanese
The second gen Japanese tokenizer now uses pyopenjtalk with full unidic, enabling pitch accent marks and improved phrase merging. Deep gratitude to @sophiefy for invaluable recommendations and nuanced help with pitch accent.
The first gen Japanese tokenizer mainly relies on cutlet => fugashi => mecab => unidic-lite, with each being a wrapper around the next. Deep gratitute to @Respaired for helping me learn the ropes of Japanese tokenization before any Kokoro model had started training.
- https://github.com/polm/cutlet
- https://github.com/polm/fugashi
- https://github.com/ikegami-yukino/jaconv
- https://github.com/studio-ousia/mojimoji
Korean
The Korean tokenizer is copied from 5Hyeons's g2pkc fork of Kyubyong's widely used g2pK library. Deep gratitute to @5Hyeons for kindly helping with Korean and extending the original code by @Kyubyong.
Chinese
The second gen Chinese tokenizer adapts better logic from paddlespeech's frontend. Jieba now cuts and tags, and pinyin-to-ipa is no longer used.
The first gen Chinese tokenizer uses jieba to cut, pypinyin, and pinyin-to-ipa.
- https://github.com/fxsjy/jieba
- https://github.com/mozillazg/python-pinyin
- https://github.com/stefantaubert/pinyin-to-ipa
Vietnamese
TODO
- Data: Compress data (no need for indented json) and eliminate redundancy between gold and silver dictionaries.
- Fallbacks: Train seq2seq fallback models on dictionaries using this notebook.
- Homographs: Escalate hard words like
axes bass bow lead tear windusing BERT contextual word embeddings (CWEs) and logistic regression (LR) models (nn.Linearfollowed by sigmoid) as described in this paper. Assumingtrf=True, BERT CWEs can be accessed viadoc._.trf_data, see en.py#L479. Per-word LR models can be trained on WikipediaHomographData, llama-hd-dataset, and LLM-generated data. - More languages: Add
ko.py,ja.py,zh.py. - Per-language pip installs
Acknowledgements
- 🛠️ Misaki builds on top of many excellent G2P projects linked above.
- 🌐 Thank you to all native speakers who advised and contributed G2P in many languages.
- 👾 Kokoro Discord server: https://discord.gg/QuGxSWBfQy
- 🌸 Misaki is a Japanese name and a character in the Terminator franchise along with Kokoro.
