Toolset for multilingual data quality filtering using FastText language identification and Transformer-MLP quality classifiers. Supports embedding generation, model training, and automated dataset curation for the FineWeb2-HQ corpus covering 20 languages. Accompanies an arXiv preprint on model-based data selection.
This page was last edited on 2026-03-03.
This page was last edited on 2026-03-03.