Publikationen

Towards Uncertainty-Aware Low-Bit Quantized LLMs for On-Device Inference

Fri, 06 Mar 2026 00:00:00 GMT

Towards Uncertainty-Aware Low-Bit Quantized LLMs for On-Device Inference Sparrenberg, Lorenz; Schneider, Tobias; Deußer, Tobias; Berger, Armin; Sifa, Rafet Quantizing large language models (LLMs) significantly reduces memory usage and computational requirements, enabling efficient on-device inference. However, aggressive quantization can degrade model performance and exacerbate prediction uncertainty. To address this critical issue, we propose a logits-based calibration strategy where the model is restricted to generating a single token from a limited set of predefined decision tokens. By applying a temperature-scaled softmax directly on the logits corresponding to these tokens, we obtain calibrated and interpretable probability distributions, explicitly circumventing stochastic methods such as top-k sampling by directly leveraging deterministic logit values, revealing subtle behavioral shifts caused by quantization. Using Qwen-2.5 models ranging from 7\,B to 72\,B parameters at various quantization levels (2, 4, 6 and 8-bit), we evaluate our method across four recently released benchmarks encompassing regression (README++, CompLex-ZH, GIRAI) and classification (DarkBench) tasks. Thus, minimizing the risk of data leakage into pre-training data. Results indicate moderate quantization (4-bit) as optimal, particularly when combined with minimal few-shot prompting, enabling quantized LLMs to closely match or surpass proprietary models such as GPT-4o and GPT-4.1 in certain tasks. Our open-source toolkit facilitates straightforward deployment of reliable, uncertainty-aware quantized LLMs for privacy-preserving, on-device inference, making them suitable for sensitive settings such as human-subject economic experiments and survey analysis.

Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp

Mon, 24 Nov 2025 00:00:00 GMT

Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp Sparrenberg, Lorenz; Deußer, Tobias; Berger, Armin; Sifa, Rafet Large Language Models (LLMs) have demonstrated remarkable capabilities but their significant computational and memory demands hinder widespread deployment, especially on resource-constrained devices. Quantization, the process of reducing the numerical precision of model parameters, has emerged as a critical technique for compressing LLMs and accelerating inference. This paper provides an overview of LLM quantization, with a particular focus on the Post-Training Quantization (PTQ) methods implemented within the popular llama.cpp framework and its GGUF file format. We begin by covering quantization fundamentals, including the distinction between PTQ and Quantization-Aware Training (QAT). We then describe the specific PTQ schemes employed by llama.cpp, including legacy methods, advanced K-quants, and recent IQ-quants, along with their underlying mathematical principles. The paper also discusses the impact of these techniques on model fidelity, hardware requirements, inference speed, and traces the adoption of GGUF as a de facto standard in the open-source community. This work serves as a practical guide and comprehensive reference for researchers aiming to deploy LLMs on resource-constrained hardware. By systematically documenting and comparing the PTQ methods within llama.cpp, we provide the necessary insights to navigate the trade-offs between model fidelity, inference speed, and memory footprint. This enables informed decision-making for real-world applications, from local CPU-based inference to efficient edge deployment.

SciREX: Scientific Relation Extraction

Tue, 23 Sep 2025 00:00:00 GMT

SciREX: Scientific Relation Extraction Karar, Sayanta; Altahan, Zyad; Aloradi, Sulaeman; Elshennawy, Abdelwahab Labonté, Frederik The rapid growth of biomedical literature makes it increasingly difficult to identify and organize meaningful knowledge. This project addresses the problem by focusing on relation extraction (RE), i.e., detecting and classifying semantic relationships between biomedical entities within scientific abstracts.
Our objective is to evaluate and compare multiple paradigms for scientific relation extraction on the BioRED dataset, a manually annotated benchmark of PubMed abstracts with diverse entities and relation types. Specifically, we investigate three complementary approaches: a classification-based model using BioBERT, a question answering formulation QA4RE, and lastly, generative models SciFive and REBEL.
The central research question guiding our study is: Which modeling paradigm offers the most effective and generalizable solution for biomedical relation extraction under the constraints of the BioRED dataset?