Show simple item record

Towards Uncertainty-Aware Low-Bit Quantized LLMs for On-Device Inference

dc.contributor.authorSparrenberg, Lorenz
dc.contributor.authorSchneider, Tobias
dc.contributor.authorDeußer, Tobias
dc.contributor.authorBerger, Armin
dc.contributor.authorSifa, Rafet
dc.date.accessioned2026-03-20T15:19:56Z
dc.date.available2026-03-20T15:19:56Z
dc.date.issued06.03.2026
dc.identifier.urihttps://hdl.handle.net/20.500.11811/13993
dc.description.abstractQuantizing large language models (LLMs) significantly reduces memory usage and computational requirements, enabling efficient on-device inference. However, aggressive quantization can degrade model performance and exacerbate prediction uncertainty. To address this critical issue, we propose a logits-based calibration strategy where the model is restricted to generating a single token from a limited set of predefined decision tokens. By applying a temperature-scaled softmax directly on the logits corresponding to these tokens, we obtain calibrated and interpretable probability distributions, explicitly circumventing stochastic methods such as top-k sampling by directly leveraging deterministic logit values, revealing subtle behavioral shifts caused by quantization. Using Qwen-2.5 models ranging from 7\,B to 72\,B parameters at various quantization levels (2, 4, 6 and 8-bit), we evaluate our method across four recently released benchmarks encompassing regression (README++, CompLex-ZH, GIRAI) and classification (DarkBench) tasks. Thus, minimizing the risk of data leakage into pre-training data. Results indicate moderate quantization (4-bit) as optimal, particularly when combined with minimal few-shot prompting, enabling quantized LLMs to closely match or surpass proprietary models such as GPT-4o and GPT-4.1 in certain tasks. Our open-source toolkit facilitates straightforward deployment of reliable, uncertainty-aware quantized LLMs for privacy-preserving, on-device inference, making them suitable for sensitive settings such as human-subject economic experiments and survey analysis.de
dc.format.extent10
dc.language.isoeng
dc.rightsIn Copyright
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectlarge language models
dc.subjectLLM
dc.subjectquantization
dc.subjectregression
dc.subjectclassification
dc.subjectQwen
dc.subjectGPT
dc.subject.ddc004 Informatik
dc.titleTowards Uncertainty-Aware Low-Bit Quantized LLMs for On-Device Inference
dc.typeKonferenzveröffentlichung
dc.publisher.nameIEEE, Institute of Electrical and Electronics Engineers
dc.publisher.locationNew York, NY
dc.rights.accessRightsopenAccess
dc.relation.doihttps://doi.org/10.1109/BigData66926.2025.11400739
ulbbn.pubtypeZweitveröffentlichung
ulbbnediss.dissNotes.extern©2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
ulbbn.relation.conference2025 IEEE International Conference on Big Data (BigData)
dc.versionacceptedVersion


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

The following license files are associated with this item:

InCopyright