LLM inference acceleration via selective replacement of self-attention with linear LMMSE approximations. CCA-based error bounding identifies low-error layers without fine-tuning. Achieves ~32% speed-up on Llama-8B with <1% accuracy loss. Supports quantization, LoRA fine-tuning, and speculative decoding.
This page was last edited on 2026-03-03.
This page was last edited on 2026-03-03.