TR2026-044

TTQ: ACTIVATION-AWARE TEST-TIME QUANTIZA- TION TO ACCELERATE LLM INFERENCE ON THE FLY


Abstract:

To tackle the huge computational demand of large foundation models, activation- aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) frame- work which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines

 

  • Related Publication

  •  Koike-Akino, T., Liu, J., Wang, Y., "TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly", arXiv, March 2026.
    BibTeX arXiv
    • @article{Koike-Akino2026mar,
    • author = {Koike-Akino, Toshiaki and Liu, Jing and Wang, Ye},
    • title = {{TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly}},
    • journal = {arXiv},
    • year = 2026,
    • month = mar,
    • url = {https://arxiv.org/abs/2603.19296}
    • }