Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet
arXiv:2606.20993v1 Announce Type: new Abstract: Multilingual language models often exhibit performance disparities across languages that can arise as early as the tokenization stage. Widely-used subword tokenization approaches favor high-resource languages, and tokenizer-free methods still yield longer sequences for scripts with a higher bytes-per-character ratio. To address these shortcomings, we propose to use the International Phonetic Alphabet (IPA) as a language-agnostic input representatio...
arXiv cs.CL
·Milan Mileti\'c, Julie Kallini, Ekaterina Shutova
·
// relacionados
Leia também
Blog
How Businesses Are Building Specialized AI They Can Trust
Blog
Fika Jobs raises $4M to build a video-first hiring platform where AI agents interview candidates
Blog
Build real agentic apps using CUGA: two dozen working examples on a lightweight harness
Blog