distilled counterfactual data

Automated generation of high-quality counterfactual data.

Antoine Bosselut — Natural Language Processing Lab

DISCO is a system that creates alternative versions of data, which can help machines learn better. It uses a language model, similar to how autocorrect works, to create these alternatives. When tested, machines trained with these alternatives performed better, especially in understanding and inferring language.

DISCO (DIStilled COunterfactual Data) is a method for automatically generating high-quality counterfactual data at scale. It uses a large general language model to generate phrasal perturbations, which are then filtered by a task-specific teacher model to distill high-quality counterfactual data. The method has been applied to natural language inference tasks, demonstrating improved robustness and generalization across distributions.

inactive — entered showcase: 2024-02-20 — entry updated: 2024-02-20

Personal Github - last commit: 2023-07-27

This project has not yet been evaluated by the C4DT Factory team. We will be happy to evaluate it upon request.

Framework

Python

MIT