Pretraining techniques
"We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices.
"We train two corresponding five-model batteries over these corpus segments, efficient pretraining and Llama3-8B parameter efficiently finetuned.
"We find that the pretrained models are faster to train than the fine-tuned baselines and that they better respect the historical divisions of our corpus.
"Emphasizing speed and precision over ahistorical comprehensiveness enables a number of novel approaches to hypothesis discovery and testing in our target fields.
"Taking up diachronic linguistics as a testbed, we show that our method enables the detection of a diverse set of phenomena, including en masse lexical change, non-lexical (grammatical and morphological) change, and word sense introduction/obsolescence."
Comments
Post a Comment
Empathy recommended