A guide to model quantization in fine-tuning (and how to pick the right GGUF)
About this post
Fine-tuning with Unsloth and Axolotl is, on the whole, a well thought-out experience where a lot of the complexity is handled for you. However one area that consistently trips people up is quantization, specifically which quant to pick when you export and save a newly trained model.
The aim of this post is to give you a simple mental model you can reuse when you have to make a quantization decision, along with a short overview of the common quant methods in Unsloth and how “Hugging Face models”, merged 16-bit checkpoints, and GGUF exports relate to each other.
While I'm using Unsloth in many of the examples, this logic applies to any LoRA fine-tune (Axolotl, TRL, etc).
Only here for the quants?
The next few sections are a relatively low-level explanation of quantization. If you are just looking for details on the available quants and how to pick one then jump to the Common GGUF quantization options section or Putting it all together section.
Note on Server-Side Deployment (vLLM/TGI)
This guide focuses on GGUF, which is the standard for local deployment (Ollama, LM Studio, Laptops).
If you are deploying to high-end NVIDIA GPUs via vLLM, you typically have two choices:
- FP16: The merged checkpoint discussed below. Best quality, highest VRAM usage.
- AWQ/GPTQ: Specialised formats for NVIDIA GPUs. If you need these, the "Mental Model" section below still applies (Training → Merged → Quant), but you will swap GGUF for AWQ in the final step.
What is quantization?
Quantization is the act of compressing model weights to fewer bits to save memory and bandwidth, at some quality cost. In practice, you take weights that are stored in high-precision floats, for example 16 or 32-bit, and map them into a much smaller set of allowed values, for example 8, 6, or 4-bit integers. Think of it like rounding from microseconds to milliseconds. You keep the overall timing but lose the tiny details that don't often matter.
What Quantization changes under the hood
Quantization reduces the precision of the model weights so the runtime can load fewer bytes and perform cheaper operations. A GGUF file is essentially the same set of tensors the model was trained with but each tensor has been compressed into lower-bit representations with additional scaling metadata so the runtime can reconstruct approximate values at inference time.
Weight quantization
Transformer layers contains large FP16 (16-bit floating point) matrices. Quantization converts each matrix into:
- Low-bit integer blocks (e.g. 4-bit, 6-bit, 8-bit values)
- Per-row or per-channel scale factors (tiny volume knobs that rescale each row or channel back to the right loudness after compression) that let the runtime map integers back to approximate floating-point ranges.
- Optional clustering codebooks (the "K" schemes) that store centroids and index codes, improving fidelity at lower bit-rates.
Lower bit-rates shrink the file and increase throughput, but introduce quantization noise - small distortions in the weight matrices. Bigger models absorb this noise more gracefully whereas small models lose capacity quickly.
What quantization does not change
- Weights are quantized.
- Activations (the temporary values a model produces while thinking, like notes written on scrap paper during a calculation) and KV caches remain FP16/FP32 unless the runtime adds seperate KV-cache quantization (a different optimisation entirely).
- The model's architecture, vocabulary, context window, and training data remain unchanged.
So quantization is therefore a deployment-time tradeoff, we accept approximation error in return for lower memory, lower latency, and cheaper inference.
A note on KV caches
During inference, transformer models build a key–value cache: a running memory of all previous tokens. This cache is stored at full precision (FP16/FP32) in almost every runtime because it changes on every new token. Think of it as the model’s scratchpad while it generates text. Quantization in GGUF does not reduce the size of this cache, because only the static model weights are compressed. Some runtimes offer separate KV-cache quantization, which trades a small drop in generation quality for lower memory use and faster long-sequence decoding, but this is independent of weight quantization.
Quantization in Unsloth
When exporting or saving models with Unsloth, you will encounter several seemingly unrelated options:
4-bit QLoRAsave_pretrained_merged(..., "merged_16bit"),- GGUF quant methods such as
q8_0orq4_k_m
These sit at different points in the model lifecycle and are easy to confuse. Unsloth intentionally keeps the underlying theory out of scope, which makes the surface area appear more abstract than it is.
The rest of this post untangles these formats, explains how they relate, and provides a simple mental model for choosing the right quant depending on model size, hardware, and workload.
A simple mental model for model formats
You can think in three stages when you're training and fine-tuning a model:
Training -> merged Hugging Face checkpoint -> deployment quant
Training
During the training phase, we are not choosing our deployment quants, but we are choosing how the base model is loaded into memory. In practice this means choosing the data type (dtype) for the weights, for example 4-bit, 8-bit or 16-bit.
In Unsloth this is configured with the load_in_4bit and load_in_8bit options:
# Loading a model in Unsloth
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/gemma-3-12b-it",
max_seq_length = 4096,
load_in_4bit = True,
load_in_8bit = False,
)
Both of these options control how the base weights are represented in GPU memory during training.
load_in_4bit
This loads the base model in 4-bit quantized format (QLoRA style). This cuts VRAM use significantly so you can fine tune larger models on smaller GPUs with less VRAM, while keeping the base weights frozen and only training LoRA adapters.
load_in_8bit
This loads the base model in 8-bit. This uses more VRAM than 4-bit, but has higher precision and can be a good choice if you have enough memory.
Which one do I choose?
You normally pick one of these, or neither (both False) if you want the base model to be loaded in FP16 (Floating-point 16). The choice only affects training-time memory and compute, not the final exported and saved model.
Later when you call the following code, Unsloth writes out a standard merged FP16 Hugging Face model - think of this as the base model with the LoRA adaptors baked in.
model.save_pretrained_merged(
merged_dir,
tokenizer,
save_method="merged_16bit",
)
Any GGUF exports you create afterwards are derived from that merged FP16 checkpoint, not from the 4-bit, or 8-bit training representation.
An example of this would be:
Load in 8-bit representation of base model for training -> save merged FP16 checkpoint (base + LoRA) -> export to GGUF with 4-bit quant to get a smaller deployment representation of the trained model
Hugging Face model and merged_16bit
You can think of a Hugging Face model as a folder (or Hub repo) that contains at least:
config.json– architecture and hyperparameters- weights – usually one or more
*.safetensorsfiles - tokenizer files –
tokenizer.json,tokenizer_config.json,special_tokens_map.json, etc
When you call from_pretrained, the following roughly happens:
- Resolve the model name or Hub URL to a folder, read
config.jsonto decide whichtransformersclass to instantiate (for exampleLlamaForCausalLM) and with which sizes. Then create an empty model of that shape. - Load the weights from the
*.safetensors(orpytorch_model.bin) files into that model. - Load the tokenizer from the tokenizer files.
After that, you have an in-memory model and tokenizer that you can train or run inference with.
When you call the following, Unsloth does one extra step before writing that folder.
model.save_pretrained_merged(
merged_dir,
tokenizer,
save_method="merged_16bit",
)
- Take the base model weights and apply the LoRA adapters from your fine-tune to those weights (merge the deltas into the base).
- Save the merged weights to disk in 16-bit floating point (FP16), alongside the config and tokenizer files, as a standard Hugging Face model directory.
So you can think of merged_16bit as:
base model + LoRA adapters → one FP16 checkpoint in standard Hugging Face format.
Treat this merged FP16 checkpoint as your "canonical high-fidelity checkpoint". All later actions you take on the model, such as GGUF exports, further fine-tunes, comparisons with quantized versions, should conceptually start from this FP16 checkpoint. If you ever regret a quant choice, you come back to this artefact and quantize again.
Quick recap
- The training data type (dtype) is an implementation and configuration detail.
- The merged FP16 checkpoint is the artefact that all future actions branch from.
- The deployment quant is how you choose to save and export your model for serving.
Common GGUF quantization options
Here's the common GGUF quant options and when it makes sense to use them.
| Quant | Memory usage | Quality vs FP16 | Typical use |
|---|---|---|---|
q8_0 |
High | Very close / “safe” | High-quality reference, evals, when VRAM ok |
q6_k |
Medium | Close | Balanced default for 7B–14B on decent GPUs |
q5_k_m |
Low | Slight drop | When VRAM is tight but quality still matters |
q4_k_m |
Very low | Noticeable drop | Helpers, CPUs, very tight VRAM, laptop demos |
q8_0
q8_0 is a near-lossless, 8-bit quant very close to FP16. Use this as a reference export for evals and a safe choice when you are unsure and have enough VRAM (we will cover what enough is in a later section).
q6_k
q6_k is a 6-bit quant method that compresses the model more than q8_0, while usually staying close in quality. The “k” just refers to a particular blockwise scheme used in GGUF and is beyond the scope of this post.
In practice, q6_k is a good balanced default when you are serving 7B–14B models on 24–48 GB GPUs. You save a decent amount of VRAM compared to q8_0, but for most chat and assistant workloads the behaviour is still very similar to your FP16 / q8_0 reference.
q5_k_m
q5_k_m is a 5-bit quant method that pushes compression a bit further than q6_k. It uses a similar blockwise scheme and typically trades a small amount of quality for lower VRAM and higher throughput.
In practice, q5_k_m is a good option when you are hosting 7B–14B models and are a bit tighter on VRAM or cost. For many chat and assistant-style workloads you will not notice much difference from q6_k, but you free up more memory for longer context windows, larger batches, or extra models on the same GPU.
q4_k_m
q4_k_m is a 4-bit quant method and one of the more aggressive options you will see. It gives you a big reduction in model size compared to q8_0 and q6_k, but with a more noticeable quality drop, especially on harder reasoning and code generation tasks.
In practice, q4_k_m is best used when hardware is the main constraint: CPU-only deployments, small GPUs, or situations where you need to squeeze a larger model onto limited VRAM. It can work very well for helper models (routers, classifiers, retrieval helpers) and lightweight chat, but for small base models (for example 1B–3B) or safety-critical assistants it is usually worth starting with a less aggressive quant like q6_k or q8_0.
Other options
There are many more GGUF quant names in the wild (fast_quantized, q3_k_m, iq2_xxs, etc). Most of them are either presets that pick one of the schemes above for you, or more extreme 2–3 bit formats for very constrained hardware.
If you are just getting started, you can safely ignore them and focus on the four we've covered.
How to choose a GGUF quant
Step 0: check model size
Model size matters when choosing a quant. Smaller models are generally less capable, so heavily compressing them increases the risk of a noticeable quality drop. They also need far less memory and compute, so you can usually afford a stronger, less aggressive quant as your default.
Here are some general guidelines:
≤ 3B
Prefer q8_0 or q6_k as your main export. Only use q4_k_m for very simple classifiers or routers, and only after checking a small eval set.
7–8B
Default to q6_k, but experiment with q5_k_m if memory is tight and your evals still look good.
≥ 12–13B
Choose between q6_k and q5_k_m based on VRAM and cost. q4_k_m is acceptable for non-critical workloads or helper models when you need to squeeze into limited hardware.
Does FP16 vs GGUF affect this?
Not really. The size rules above are about how aggressively you quantize a given base model, based on its parameter count. Your merged FP16 Hugging Face checkpoint is the “truth” model; a GGUF quant is just a compressed version of that checkpoint. The model-size decision is about bits-per-weight × number of parameters, regardless of whether those weights live in a HF folder or a GGUF file.
What about MoE models?
For MoE (Mixture of Experts) models, be conservative as they are a trap for VRM planning. They are spare in compute but dense in VRAM.
While they run fast (they only use a fraction of their parameters per token), you must still load every single expert into GPU memory for them to run.
Ignore the active parameter count and size by them by their total parameter count.
For quantization it's important to remember that MoE's are already sparse (similar to compressed) so they can be more sensitive to aggressive quantization. Avoid q4_k_m and prefer q5_k_m or above to ensure the expert chosen for the token has a useful level of precision.
Thinking vs instruct models
For "thinking" models or reasoning-focused models that generate longer internal chains of thought, treat them as complex, high-error-cost tasks even if they are not customer-facing. Long reasoning chains tend to amplify small numeric differences, so it is safer to stay with q6_k or q8_0. Short instruct-style answers, classification and routing are usually more tolerant of q5_k_m or q4_k_m.
Step 1: check your hardware
Next, sanity-check what your hardware can actually hold.
As a rough guide for weights only:
- 7B in FP16 ≈ 14 GB, in
q8_0≈ 7 GB, inq4_k_m≈ 3.5 GB - 12B in FP16 ≈ 24 GB, in
q8_0≈ 12 GB, inq4_k_m≈ 6 GB
On top of the weights you also need VRAM for KV cache (context length × batch size × layers) and activations, so keep at least 30–50% headroom.
With that in mind:
24–48 GB GPU
- For 7B–14B models,
q6_kis a very comfortable default. - Keep one
q8_0export around as a high-fidelity reference. - Use
q5_k_mif you want longer context or multiple models on the same card.
8–16 GB GPU or mixed CPU/GPU
- Use
q6_kfor the main chat model if it fits with some margin; otherwise drop toq5_k_m. - Use
q4_k_m/q5_k_mfor helper models (routers, classifiers, retrieval helpers).
CPU-only / edge / laptop
- You will almost always want
q4_k_mor more aggressive quants here, especially for anything above 3B. - Reserve
q6_k/q8_0for very small models or offline batch jobs where latency is not critical.
Step 2: check the task
The best approach to identifying the right quant for your task is to work back from the following:
- How expensive are my mistakes?
- How hard is the task?
The general rules to apply here are:
High error cost or complex tasks
Use q6_k or q8_0 even on bigger models.
Examples: Regulated industries, safety-sensitivity, customer-facing, long-form reasoning, code-generation, multi-step flows
Low error cost and simplicity
q4_k_m / q5_k_m are usually fine.
Examples: retrieval strategy classifiers, router models, intent classifiers.
Putting it all together
When you are staring at a list of GGUF quants, you do not need to remember every detail from this post. You can usually get to a sensible choice by running this checklist:
Check model size
- ≤ 3B: start from
q8_0orq6_k. - 7–8B: start from
q6_k, considerq5_k_mif needed. - ≥ 12–13B (or MoE): choose between
q6_kandq5_k_m, keepq4_k_mfor helpers and low-risk use.
Check your hardware
- 24–48 GB GPU:
q6_kas default, keep oneq8_0export. - 8–16 GB GPU:
q6_kif it fits, otherwiseq5_k_m; useq4_k_mfor helpers. - CPU / edge / laptop: usually
q4_k_mor more aggressive, especially above 3B.
Check the task**
- High error cost or complex reasoning (regulated chat, code, agents, "thinking" models): favour
q6_korq8_0. - Low error cost and simple tasks (routers, intent classifiers, retrieval helpers):
q4_k_m/q5_k_mare usually fine.
In practice, it helps to export two variants the first time you deploy a new model, for example:
# Example export loop
# Ensure you define hf_token or login via huggingface-cli
quant_methods = ["q8_0", "q6_k"]
for quant in quant_methods:
print(f"Exporting {quant}...")
model.push_to_hub_gguf(
f"{base_repo_id}-{quant}",
tokenizer,
quantization_method=quant,
token=hf_token,
)
Run a small golden set of prompts through both and pick the cheapest quant that hits your quality bar. Keep the q8_0 (or merged FP16) around as your high-fidelity reference, and use the more compressed quant for day-to-day serving.
Over time you will build your own defaults per project, but this simple three-step loop should be enough to stop "which quant do I pick?" from blocking you every time you export a model.
Advanced: What about QAT and Dynamic Quants?
You might see terms like QAT (Quantization Aware Training) or Dynamic GGUFs in the Unsloth docs. Here is how they fit into this mental model:
Dynamic GGUFs (The new standard)
When you run model.save_pretrained_gguf, Unsloth now defaults to "Dynamic" quantization. This doesn't change your workflow. It just means the "Quant" step is smarter—it keeps sensitive layers in higher precision (e.g., 6-bit) and drops robust layers to lower precision (e.g., 4-bit), giving you better quality at the same file size.
QAT (The "Pro" lane)
This is a specific training technique where you simulate quantization errors during training so the model learns to adapt to them. It effectively merges the "Training" and "Quant" steps. This is powerful (recovering up to 70% of lost accuracy) but requires a different training setup using torchao. For now, treat this as a specialised tool for when standard quants aren't good enough.