NBL

NBL

Neural Block Linearization replacing self-attention with LMMSE approximations for faster LLM inference.

LLM inference acceleration via selective replacement of self-attention with linear LMMSE approximations. CCA-based error bounding identifies low-error layers without fine-tuning. Achieves ~32% speed-up on Llama-8B with <1% accuracy loss. Supports quantization, LoRA fine-tuning, and speculative decoding.

Deep Neural NetworksLarge Language Model
Key facts
Maturity
Support
C4DT
Inactive
Lab
Active
  • Technical

Laboratory for Information and Inference Systems

Laboratory for Information and Inference Systems
Volkan Cevher

Prof. Volkan Cevher

At LIONS, we are concerned with optimized information extraction from signals or data volumes. We therefore develop mathematical theory and computational methods for information recovery from highly incomplete data.

This page was last edited on 2026-03-03.