AdaptiVocab Personalized LLM Vocabulary For Your Data
יום חמישי 05.12 13:30
Abstract: In natural language processing (NLP), the model’s vocabulary is frequently overlooked or left unchanged. This oversight occurs despite the considerable differences in vocabularies used across various domains and its significant impact on the model’s performance and efficiency. Our work introduces AdaptiVocab - a method to tailor language model vocabularies specifically for distinct domains. Using AdaptiVocab we refine the original language model's vocabulary to align with our domain's specific lexicon. Our approach involves choosing dynamically which tokens are not frequent in our domain, and removing them from the model's vocabulary, and adding carefully chosen 'n-grams' made up of either multiple tokens or multiple words, that better represent our specific domain. This vocabulary change personalizes the LLM vocabulary to the domain task with little amount of training and yields a significant reduction in inference (and Fine-Tuning) times. As part of our contribution, we created a task-based dataset consisting of domain-specific tasks, on domains consisting of words that are challenging to tokenize efficiently using standard tokenization techniques. This dataset will highlight the inefficiencies of conventional tokenizers when handling such terms and will be used to compare the performance of various models and methods in this field.