Show simple item record

Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp

dc.contributor.authorSparrenberg, Lorenz
dc.contributor.authorDeußer, Tobias
dc.contributor.authorBerger, Armin
dc.contributor.authorSifa, Rafet
dc.date.accessioned2025-12-15T16:50:13Z
dc.date.available2025-12-15T16:50:13Z
dc.date.issued24.11.2025
dc.identifier.urihttps://hdl.handle.net/20.500.11811/13751
dc.description.abstractLarge Language Models (LLMs) have demonstrated remarkable capabilities but their significant computational and memory demands hinder widespread deployment, especially on resource-constrained devices. Quantization, the process of reducing the numerical precision of model parameters, has emerged as a critical technique for compressing LLMs and accelerating inference. This paper provides an overview of LLM quantization, with a particular focus on the Post-Training Quantization (PTQ) methods implemented within the popular llama.cpp framework and its GGUF file format. We begin by covering quantization fundamentals, including the distinction between PTQ and Quantization-Aware Training (QAT). We then describe the specific PTQ schemes employed by llama.cpp, including legacy methods, advanced K-quants, and recent IQ-quants, along with their underlying mathematical principles. The paper also discusses the impact of these techniques on model fidelity, hardware requirements, inference speed, and traces the adoption of GGUF as a de facto standard in the open-source community. This work serves as a practical guide and comprehensive reference for researchers aiming to deploy LLMs on resource-constrained hardware. By systematically documenting and comparing the PTQ methods within llama.cpp, we provide the necessary insights to navigate the trade-offs between model fidelity, inference speed, and memory footprint. This enables informed decision-making for real-world applications, from local CPU-based inference to efficient edge deployment.en
dc.format.extent10
dc.language.isoeng
dc.rightsIn Copyright
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectLarge Language Models
dc.subjectLLM
dc.subjectQuantization
dc.subjectModel Compression
dc.subjectPost-Training Quantization
dc.subjectllama.cpp
dc.subjectGGUF
dc.subjectK-quants
dc.subjectInference Efficiency
dc.subject.ddc004 Informatik
dc.titleSmall and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp
dc.typeKonferenzveröffentlichung
dc.publisher.nameIEEE, Institute of Electrical and Electronics Engineers
dc.publisher.locationNew York, NY
dc.rights.accessRightsopenAccess
dc.relation.doihttps://doi.org/10.1109/DSAA65442.2025.11247985
ulbbn.pubtypeZweitveröffentlichung
ulbbnediss.dissNotes.extern© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
ulbbn.relation.conference2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA)
dc.versionpublishedVersion


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

The following license files are associated with this item:

InCopyright