Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet

arXiv:2606.20993v1 Announce Type: new Abstract: Multilingual language models often exhibit performance disparities across languages that can arise as early as the tokenization stage. Widely-used subword tokenization approaches favor high-resource languages, and tokenizer-free methods still yield longer sequences for scripts with a higher bytes-per-character ratio. To address these shortcomings, we propose to use the International Phonetic Alphabet (IPA) as a language-agnostic input representatio...

arXiv cs.CL ·Milan Mileti\'c, Julie Kallini, Ekaterina Shutova ·
compartilhar: