Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp

Sparrenberg, Lorenz; Deußer, Tobias; Berger, Armin; Sifa, Rafet

dc.contributor.author	Sparrenberg, Lorenz
dc.contributor.author	Deußer, Tobias
dc.contributor.author	Berger, Armin
dc.contributor.author	Sifa, Rafet
dc.date.accessioned	2025-12-15T16:50:13Z
dc.date.available	2025-12-15T16:50:13Z
dc.date.issued	24.11.2025
dc.identifier.uri	https://hdl.handle.net/20.500.11811/13751
dc.description.abstract	Large Language Models (LLMs) have demonstrated remarkable capabilities but their significant computational and memory demands hinder widespread deployment, especially on resource-constrained devices. Quantization, the process of reducing the numerical precision of model parameters, has emerged as a critical technique for compressing LLMs and accelerating inference. This paper provides an overview of LLM quantization, with a particular focus on the Post-Training Quantization (PTQ) methods implemented within the popular llama.cpp framework and its GGUF file format. We begin by covering quantization fundamentals, including the distinction between PTQ and Quantization-Aware Training (QAT). We then describe the specific PTQ schemes employed by llama.cpp, including legacy methods, advanced K-quants, and recent IQ-quants, along with their underlying mathematical principles. The paper also discusses the impact of these techniques on model fidelity, hardware requirements, inference speed, and traces the adoption of GGUF as a de facto standard in the open-source community. This work serves as a practical guide and comprehensive reference for researchers aiming to deploy LLMs on resource-constrained hardware. By systematically documenting and comparing the PTQ methods within llama.cpp, we provide the necessary insights to navigate the trade-offs between model fidelity, inference speed, and memory footprint. This enables informed decision-making for real-world applications, from local CPU-based inference to efficient edge deployment.	en
dc.format.extent	10
dc.language.iso	eng
dc.rights	In Copyright
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	Large Language Models
dc.subject	LLM
dc.subject	Quantization
dc.subject	Model Compression
dc.subject	Post-Training Quantization
dc.subject	llama.cpp
dc.subject	GGUF
dc.subject	K-quants
dc.subject	Inference Efficiency
dc.subject.ddc	004 Informatik
dc.title	Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp
dc.type	Konferenzveröffentlichung
dc.publisher.name	IEEE, Institute of Electrical and Electronics Engineers
dc.publisher.location	New York, NY
dc.rights.accessRights	openAccess
dc.relation.doi	https://doi.org/10.1109/DSAA65442.2025.11247985
ulbbn.pubtype	Zweitveröffentlichung
ulbbnediss.dissNotes.extern	© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
ulbbn.relation.conference	2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA)
dc.version	publishedVersion

Files in this item

Name:: 2025_LLM_Hardware.pdf
Size:: 496.2KB
Format:: PDF

View/Open

This item appears in the following Collection(s)

Publikationen (3)

Show simple item record

The following license files are associated with this item: