Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp
Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp

| dc.contributor.author | Sparrenberg, Lorenz | |
| dc.contributor.author | Deußer, Tobias | |
| dc.contributor.author | Berger, Armin | |
| dc.contributor.author | Sifa, Rafet | |
| dc.date.accessioned | 2025-12-15T16:50:13Z | |
| dc.date.available | 2025-12-15T16:50:13Z | |
| dc.date.issued | 24.11.2025 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.11811/13751 | |
| dc.description.abstract | Large Language Models (LLMs) have demonstrated remarkable capabilities but their significant computational and memory demands hinder widespread deployment, especially on resource-constrained devices. Quantization, the process of reducing the numerical precision of model parameters, has emerged as a critical technique for compressing LLMs and accelerating inference. This paper provides an overview of LLM quantization, with a particular focus on the Post-Training Quantization (PTQ) methods implemented within the popular llama.cpp framework and its GGUF file format. We begin by covering quantization fundamentals, including the distinction between PTQ and Quantization-Aware Training (QAT). We then describe the specific PTQ schemes employed by llama.cpp, including legacy methods, advanced K-quants, and recent IQ-quants, along with their underlying mathematical principles. The paper also discusses the impact of these techniques on model fidelity, hardware requirements, inference speed, and traces the adoption of GGUF as a de facto standard in the open-source community. This work serves as a practical guide and comprehensive reference for researchers aiming to deploy LLMs on resource-constrained hardware. By systematically documenting and comparing the PTQ methods within llama.cpp, we provide the necessary insights to navigate the trade-offs between model fidelity, inference speed, and memory footprint. This enables informed decision-making for real-world applications, from local CPU-based inference to efficient edge deployment. | en |
| dc.format.extent | 10 | |
| dc.language.iso | eng | |
| dc.rights | In Copyright | |
| dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | |
| dc.subject | Large Language Models | |
| dc.subject | LLM | |
| dc.subject | Quantization | |
| dc.subject | Model Compression | |
| dc.subject | Post-Training Quantization | |
| dc.subject | llama.cpp | |
| dc.subject | GGUF | |
| dc.subject | K-quants | |
| dc.subject | Inference Efficiency | |
| dc.subject.ddc | 004 Informatik | |
| dc.title | Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp | |
| dc.type | Konferenzveröffentlichung | |
| dc.publisher.name | IEEE, Institute of Electrical and Electronics Engineers | |
| dc.publisher.location | New York, NY | |
| dc.rights.accessRights | openAccess | |
| dc.relation.doi | https://doi.org/10.1109/DSAA65442.2025.11247985 | |
| ulbbn.pubtype | Zweitveröffentlichung | |
| ulbbnediss.dissNotes.extern | © 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. | |
| ulbbn.relation.conference | 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA) | |
| dc.version | publishedVersion |
Files in this item
This item appears in the following Collection(s)
-
Publikationen (3)




