<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
<channel>
<title>Publikationen</title>
<link>https://hdl.handle.net/20.500.11811/1549</link>
<description/>
<pubDate>Fri, 10 Apr 2026 17:49:17 GMT</pubDate>
<dc:date>2026-04-10T17:49:17Z</dc:date>
<item>
<title>Towards Uncertainty-Aware Low-Bit Quantized LLMs for On-Device Inference</title>
<link>https://hdl.handle.net/20.500.11811/13993</link>
<description>Towards Uncertainty-Aware Low-Bit Quantized LLMs for On-Device Inference
Sparrenberg, Lorenz; Schneider, Tobias; Deußer, Tobias; Berger, Armin; Sifa, Rafet
Quantizing large language models (LLMs) significantly reduces memory usage and computational requirements, enabling efficient on-device inference. However, aggressive quantization can degrade model performance and exacerbate prediction uncertainty. To address this critical issue, we propose a logits-based calibration strategy where the model is restricted to generating a single token from a limited set of predefined decision tokens.  By applying a temperature-scaled softmax directly on the logits corresponding to these tokens, we obtain calibrated and interpretable probability distributions, explicitly circumventing stochastic methods such as top-k sampling by directly leveraging deterministic logit values, revealing subtle behavioral shifts caused by quantization. Using Qwen-2.5 models ranging from 7\,B to 72\,B parameters at various quantization levels (2, 4, 6 and 8-bit), we evaluate our method across four recently released benchmarks encompassing regression (README++, CompLex-ZH, GIRAI) and classification (DarkBench) tasks. Thus, minimizing the risk of data leakage into pre-training data. Results indicate moderate quantization (4-bit) as optimal, particularly when combined with minimal few-shot prompting, enabling quantized LLMs to closely match or surpass proprietary models such as GPT-4o and GPT-4.1 in certain tasks. Our open-source toolkit facilitates straightforward deployment of reliable, uncertainty-aware quantized LLMs for privacy-preserving, on-device inference, making them suitable for sensitive settings such as human-subject economic experiments and survey analysis.
</description>
<pubDate>Fri, 06 Mar 2026 00:00:00 GMT</pubDate>
<guid isPermaLink="false">https://hdl.handle.net/20.500.11811/13993</guid>
<dc:date>2026-03-06T00:00:00Z</dc:date>
</item>
<item>
<title>Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp</title>
<link>https://hdl.handle.net/20.500.11811/13751</link>
<description>Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp
Sparrenberg, Lorenz; Deußer, Tobias; Berger, Armin; Sifa, Rafet
Large Language Models (LLMs) have demonstrated remarkable capabilities but their significant computational and memory demands hinder widespread deployment, especially on resource-constrained devices. Quantization, the process of reducing the numerical precision of model parameters, has emerged as a critical technique for compressing LLMs and accelerating inference. This paper provides an overview of LLM quantization, with a particular focus on the Post-Training Quantization (PTQ) methods implemented within the popular llama.cpp framework and its GGUF file format. We begin by covering quantization fundamentals, including the distinction between PTQ and Quantization-Aware Training (QAT). We then describe the specific PTQ schemes employed by llama.cpp, including legacy methods, advanced K-quants, and recent IQ-quants, along with their underlying mathematical principles. The paper also discusses the impact of these techniques on model fidelity, hardware requirements, inference speed, and traces the adoption of GGUF as a de facto standard in the open-source community. This work serves as a practical guide and comprehensive reference for researchers  aiming to deploy LLMs on resource-constrained hardware. By systematically documenting and comparing the PTQ methods within llama.cpp, we provide the necessary insights to navigate the trade-offs between model fidelity, inference speed, and memory footprint. This enables informed decision-making for real-world applications, from local CPU-based inference to efficient edge deployment.
</description>
<pubDate>Mon, 24 Nov 2025 00:00:00 GMT</pubDate>
<guid isPermaLink="false">https://hdl.handle.net/20.500.11811/13751</guid>
<dc:date>2025-11-24T00:00:00Z</dc:date>
</item>
<item>
<title>SciREX: Scientific Relation Extraction</title>
<link>https://hdl.handle.net/20.500.11811/13574</link>
<description>SciREX: Scientific Relation Extraction
Karar, Sayanta; Altahan, Zyad; Aloradi, Sulaeman; Elshennawy, Abdelwahab
Labonté, Frederik
The rapid growth of biomedical literature makes it increasingly difficult to identify and organize meaningful knowledge. This project addresses the problem by focusing on &lt;strong&gt;relation extraction (RE)&lt;/strong&gt;, i.e., detecting and classifying semantic relationships between biomedical entities within scientific abstracts. &lt;br/&gt;&#13;
Our objective is to evaluate and compare multiple paradigms for scientific relation extraction on the BioRED dataset, a manually annotated benchmark of PubMed abstracts with diverse entities and relation types. Specifically, we investigate three complementary approaches: a &lt;strong&gt;classification-based model&lt;/strong&gt; using BioBERT, a &lt;strong&gt;question answering formulation&lt;/strong&gt; QA4RE, and lastly, &lt;strong&gt;generative models&lt;/strong&gt; SciFive and REBEL. &lt;br/&gt;&#13;
The central research question guiding our study is: &lt;em&gt;Which modeling paradigm offers the most effective and generalizable solution for biomedical relation extraction under the constraints of the BioRED dataset?&lt;/em&gt;
</description>
<pubDate>Tue, 23 Sep 2025 00:00:00 GMT</pubDate>
<guid isPermaLink="false">https://hdl.handle.net/20.500.11811/13574</guid>
<dc:date>2025-09-23T00:00:00Z</dc:date>
</item>
</channel>
</rss>
