Search
Now showing items 1-2 of 2
Towards Uncertainty-Aware Low-Bit Quantized LLMs for On-Device Inference
(2026-03-06)
Quantizing large language models (LLMs) significantly reduces memory usage and computational requirements, enabling efficient on-device inference. However, aggressive quantization can degrade model performance and exacerbate ...
Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp
(2025-11-24)
Large Language Models (LLMs) have demonstrated remarkable capabilities but their significant computational and memory demands hinder widespread deployment, especially on resource-constrained devices. Quantization, the ...




