Towards Uncertainty-Aware Low-Bit Quantized LLMs for On-Device Inference

Sparrenberg, Lorenz; Schneider, Tobias; Deußer, Tobias; Berger, Armin; Sifa, Rafet

dc.contributor.author	Sparrenberg, Lorenz
dc.contributor.author	Schneider, Tobias
dc.contributor.author	Deußer, Tobias
dc.contributor.author	Berger, Armin
dc.contributor.author	Sifa, Rafet
dc.date.accessioned	2026-03-20T15:19:56Z
dc.date.available	2026-03-20T15:19:56Z
dc.date.issued	06.03.2026
dc.identifier.uri	https://hdl.handle.net/20.500.11811/13993
dc.description.abstract	Quantizing large language models (LLMs) significantly reduces memory usage and computational requirements, enabling efficient on-device inference. However, aggressive quantization can degrade model performance and exacerbate prediction uncertainty. To address this critical issue, we propose a logits-based calibration strategy where the model is restricted to generating a single token from a limited set of predefined decision tokens. By applying a temperature-scaled softmax directly on the logits corresponding to these tokens, we obtain calibrated and interpretable probability distributions, explicitly circumventing stochastic methods such as top-k sampling by directly leveraging deterministic logit values, revealing subtle behavioral shifts caused by quantization. Using Qwen-2.5 models ranging from 7\,B to 72\,B parameters at various quantization levels (2, 4, 6 and 8-bit), we evaluate our method across four recently released benchmarks encompassing regression (README++, CompLex-ZH, GIRAI) and classification (DarkBench) tasks. Thus, minimizing the risk of data leakage into pre-training data. Results indicate moderate quantization (4-bit) as optimal, particularly when combined with minimal few-shot prompting, enabling quantized LLMs to closely match or surpass proprietary models such as GPT-4o and GPT-4.1 in certain tasks. Our open-source toolkit facilitates straightforward deployment of reliable, uncertainty-aware quantized LLMs for privacy-preserving, on-device inference, making them suitable for sensitive settings such as human-subject economic experiments and survey analysis.	de
dc.format.extent	10
dc.language.iso	eng
dc.rights	In Copyright
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	large language models
dc.subject	LLM
dc.subject	quantization
dc.subject	regression
dc.subject	classification
dc.subject	Qwen
dc.subject	GPT
dc.subject.ddc	004 Informatik
dc.title	Towards Uncertainty-Aware Low-Bit Quantized LLMs for On-Device Inference
dc.type	Konferenzveröffentlichung
dc.publisher.name	IEEE, Institute of Electrical and Electronics Engineers
dc.publisher.location	New York, NY
dc.rights.accessRights	openAccess
dc.relation.doi	https://doi.org/10.1109/BigData66926.2025.11400739
ulbbn.pubtype	Zweitveröffentlichung
ulbbnediss.dissNotes.extern	©2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
ulbbn.relation.conference	2025 IEEE International Conference on Big Data (BigData)
dc.version	acceptedVersion

Files in this item

Name:: 2026_Uncertainty-Aware_LLMs.pdf
Size:: 302.8KB
Format:: PDF

View/Open

This item appears in the following Collection(s)

Publikationen (3)

Show simple item record

The following license files are associated with this item: