Single-pass jailbreak detection classifier for LLMs extending JailbreakBench. Trained on 4,000+ samples from GCG, AutoDAN, and PAIR attack methods. TMLR 2025 paper. Provides training and evaluation scripts and a unified detection interface.
This page was last edited on 2026-03-03.
This page was last edited on 2026-03-03.