crow
Benchmarking Commonsense Reasoning in Real-World Tasks
CRoW is a tool that tests how well computer models can use common sense when performing six different language-related tasks. It does this by taking examples from existing datasets and changing them in ways that violate common sense. The results show that these computer models are still far from matching human performance in using common sense in real-world tasks.
CRoW is a manually-curated, multi-task benchmark that evaluates the ability of models to apply commonsense reasoning in the context of six real-world NLP tasks. It is constructed using a multi-stage data collection pipeline that rewrites examples from existing datasets using commonsense-violating perturbations. The study reveals a significant performance gap when NLP systems are evaluated on CRoW compared to humans, indicating that commonsense reasoning is far from being solved in real-world task settings.
active
—
entered showcase: 2024-02-20
—
entry updated: 2024-02-20
This project has not yet been evaluated by the C4DT Factory team.
We will be happy to evaluate it upon request.
Toolset
Python