FineWeb2-HQ

FineWeb2-HQ

Multilingual data filtering tools for LLM pretraining using FastText and Transformer-MLP classifiers.

Toolset for multilingual data quality filtering using FastText language identification and Transformer-MLP quality classifiers. Supports embedding generation, model training, and automated dataset curation for the FineWeb2-HQ corpus covering 20 languages. Accompanies an arXiv preprint on model-based data selection.

Large Language ModelNatural Language
Key facts
Maturity
Support
C4DT
Inactive
Lab
Unknown
  • Technical

Machine Learning and Optimization Laboratory

Machine Learning and Optimization Laboratory
Martin Jaggi

Prof. Martin Jaggi

The Machine Learning and Optimization Laboratory is interested in machine learning, optimization algorithms and text understanding, as well as several application domains.

This page was last edited on 2026-03-03.