Multi-turn hallucination benchmark evaluating LLMs across diverse domains. Installed via Pixi; scripts provided for response generation, claim-based web-scraping judgment (or coding_direct mode), and report creation. Supports multiple models and CLI configuration. Designed to maximize hallucination elicitation difficulty.
This page was last edited on 2026-03-03.
This page was last edited on 2026-03-03.