A guide to model quantization in fine-tuning (and how to pick the right GGUF)

05 Dec 2025 · Updated 13 Feb 2026

About this post

Fine-tuning with Unsloth and Axolotl is, on the whole, a well thought-out experience where a lot of the complexity is handled for you. However one area that consistently trips people up is quantization, specifically which quant to pick when you export and save a newly trained model.

The aim of this post is to give you a simple mental model you can reuse when you have to make a quantization decision, along with a short overview of the common quant methods in Unsloth and how “Hugging Face models”, merged 16-bit checkpoints, and GGUF exports relate to each other.

While I'm using Unsloth in many of the examples, this logic applies to any LoRA fine-tune (Axolotl, TRL, etc).

Only here for the quants?

The next few sections are a relatively low-level explanation of quantization. If you are just looking for details on the available quants and how to pick one then jump to the Common GGUF quantization options section or Putting it all together section.

Note on Server-Side Deployment (vLLM/TGI)

This guide focuses on GGUF, which is the standard for local deployment (Ollama, LM Studio, Laptops).

If you are deploying to high-end NVIDIA GPUs via vLLM, you typically have two choices:

FP16: The merged checkpoint discussed below. Best quality, highest VRAM usage.
AWQ/GPTQ: Specialised formats for NVIDIA GPUs. If you need these, the "Mental Model" section below still applies (Training → Merged → Quant), but you will swap GGUF for AWQ in the final step.

What is quantization?

Quantization is the act of compressing model weights to fewer bits to save memory and bandwidth, at some quality cost. In practice, you take weights that are stored in high-precision floats, for example 16 or 32-bit, and map them into a much smaller set of allowed values, for example 8, 6, or 4-bit integers. Think of it like rounding from microseconds to milliseconds. You keep the overall timing but lose the tiny details that don't often matter.

What Quantization changes under the hood

Quantization reduces the precision of the model weights so the runtime can load fewer bytes and perform cheaper operations. A GGUF file is the same set of tensors the model was trained with but each tensor has been compressed into lower-bit representations with additional scaling metadata so the runtime can reconstruct approximate values at inference time.

Weight quantization

Transformer layers contains large FP16 (16-bit floating point) matrices. Quantization converts each matrix into:

Low-bit integer blocks (e.g. 4-bit, 6-bit, 8-bit values)
Per-row or per-channel scale factors (tiny volume knobs that rescale each row or channel back to the right loudness after compression) that let the runtime map integers back to approximate floating-point ranges.
Optional clustering codebooks (the "K" schemes) that store centroids and index codes, improving fidelity at lower bit-rates.

Lower bit-rates shrink the file and increase throughput, but introduce quantization noise - small distortions in the weight matrices. Bigger models absorb this noise more gracefully whereas small models lose capacity quickly.

What quantization does not change

Weights are quantized.
Activations (the temporary values a model produces while thinking, like notes written on scrap paper during a calculation) and KV caches remain FP16/FP32 unless the runtime adds seperate KV-cache quantization (a different optimisation entirely).
The model's architecture, vocabulary, context window, and training data remain unchanged.

So quantization is therefore a deployment-time tradeoff. We accept approximation error in return for lower memory, lower latency, and cheaper inference.

During inference, transformer models build a key–value cache: a running memory of all previous tokens. This cache is stored at full precision (FP16/FP32) in almost every runtime because it changes on every new token. Think of it as the model’s scratchpad while it generates text. Quantization in GGUF does not reduce the size of this cache, because only the static model weights are compressed. Some runtimes offer separate KV-cache quantization, which trades a small drop in generation quality for lower memory use and faster long-sequence decoding, but this is independent of weight quantization.

A note on KV caches

Quantization in Unsloth

When exporting or saving models with Unsloth, you will encounter several seemingly unrelated options:

4-bit QLoRA
save_pretrained_merged(..., "merged_16bit"),
GGUF quant methods such as q8_0 or q4_k_m

These sit at different points in the model lifecycle and are easy to confuse. Unsloth intentionally keeps the underlying theory out of scope, which makes the surface area appear more abstract than it is.

The rest of this post untangles these formats, explains how they relate, and provides a simple mental model for choosing the right quant depending on model size, hardware, and workload.

A simple mental model for model formats

You can think in three stages when you're training and fine-tuning a model:

Training -> merged Hugging Face checkpoint -> deployment quant

Training

During the training phase, we are not choosing our deployment quants, but we are choosing how the base model is loaded into memory. In practice this means choosing the data type (dtype) for the weights, for example 4-bit, 8-bit or 16-bit.

In Unsloth this is configured with the load_in_4bit and load_in_8bit options:

# Loading a model in Unsloth
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastModel.from_pretrained(
	model_name = "unsloth/gemma-3-12b-it",
	max_seq_length = 4096,
	load_in_4bit = True,
	load_in_8bit = False,
)

Both of these options control how the base weights are represented in GPU memory during training.

load_in_4bit

This loads the base model in 4-bit quantized format (QLoRA style). This cuts VRAM use significantly so you can fine tune larger models on smaller GPUs with less VRAM, while keeping the base weights frozen and only training LoRA adapters.

load_in_8bit

This loads the base model in 8-bit. This uses more VRAM than 4-bit, but has higher precision and can be a good choice if you have enough memory.

Which one do I choose?

You normally pick one of these, or neither (both False) if you want the base model to be loaded in FP16 (Floating-point 16). The choice only affects training-time memory and compute, not the final exported and saved model.

Later when you call the following code, Unsloth writes out a standard merged FP16 Hugging Face model - think of this as the base model with the LoRA adaptors baked in.

model.save_pretrained_merged(
    merged_dir,
    tokenizer,
    save_method="merged_16bit",
)

Any GGUF exports you create afterwards are derived from that merged FP16 checkpoint, not from the 4-bit, or 8-bit training representation.

An example of this would be:

-> Load in 8-bit representation of base model for training
-> save merged FP16 checkpoint (base + LoRA)
-> export to GGUF with 4-bit quant to get a smaller deployment representation of the trained model

Hugging Face model and `merged_16bit`

You can think of a Hugging Face model as a folder (or Hub repo) that contains at least:

config.json – architecture and hyperparameters
weights – usually one or more *.safetensors files
tokenizer files – tokenizer.json, tokenizer_config.json, special_tokens_map.json, etc

When you call from_pretrained, the following roughly happens:

Resolve the model name or Hub URL to a folder, read config.json to decide which transformers class to instantiate (for example LlamaForCausalLM) and with which sizes. Then create an empty model of that shape.
Load the weights from the *.safetensors (or pytorch_model.bin) files into that model.
Load the tokenizer from the tokenizer files.

After that, you have an in-memory model and tokenizer that you can train or run inference with.

When you call the following code, Unsloth does one extra step before writing that folder.

model.save_pretrained_merged(
    merged_dir,
    tokenizer,
    save_method="merged_16bit",
)

-> Takes the base model weights and applies the LoRA adapters from your fine-tune to those weights (merge the deltas into the base).
-> Save the merged weights to disk in 16-bit floating point (FP16), alongside the config and tokenizer files, as a standard Hugging Face model directory.

So you can think of merged_16bit as:

base model + LoRA adapters → one FP16 checkpoint in standard Hugging Face format.

Treat this merged FP16 checkpoint as your "canonical high-fidelity checkpoint". All later actions you take on the model, such as GGUF exports, further fine-tunes, comparisons with quantized versions, should conceptually start from this FP16 checkpoint. If you ever regret a quant choice, you come back to this artefact and quantize again.

Quick recap

The training data type (dtype) is an implementation and configuration detail.
The merged FP16 checkpoint is the artefact that all future actions branch from.
The deployment quant is how you choose to save and export your model for serving.

Common GGUF quantization options

Here's the common GGUF quant options and when it makes sense to use them.

Quant	Memory usage	Quality vs FP16	Typical use
`q8_0`	High	Very close / “safe”	High-quality reference, evals, when VRAM ok
`q6_k`	Medium	Close	Balanced default for 7B–14B on decent GPUs
`q5_k_m`	Low	Slight drop	When VRAM is tight but quality still matters
`q4_k_m`	Very low	Noticeable drop	Helpers, CPUs, very tight VRAM, laptop demos

`q8_0`

q8_0 is a near-lossless, 8-bit quant very close to FP16. Use this as a reference export for evals and a safe choice when you are unsure and have enough VRAM (we will cover what enough is in a later section).

`q6_k`

q6_k is a 6-bit quant method that compresses the model more than q8_0, while usually staying close in quality. The “k” just refers to a particular blockwise scheme used in GGUF and is beyond the scope of this post.

In practice, q6_k is a good balanced default when you are serving 7B–14B models on 24–48 GB GPUs. You save a decent amount of VRAM compared to q8_0, but for most chat and assistant workloads the behaviour is still very similar to your FP16 / q8_0 reference.

`q5_k_m`

q5_k_m is a 5-bit quant method that pushes compression a bit further than q6_k. It uses a similar blockwise scheme and typically trades a small amount of quality for lower VRAM and higher throughput.

In practice, q5_k_m is a good option when you are hosting 7B–14B models and are a bit tighter on VRAM or cost. For many chat and assistant-style workloads you will not notice much difference from q6_k, but you release more memory for longer context windows, larger batches, or extra models on the same GPU.

`q4_k_m`

q4_k_m is a 4-bit quant method and one of the more aggressive options you will see. It gives you a big reduction in model size compared to q8_0 and q6_k, but with a more noticeable quality drop, especially on harder reasoning and code generation tasks.

In practice, q4_k_m is best used when hardware is the main constraint: CPU-only deployments, small GPUs, or situations where you need to squeeze a larger model onto limited VRAM. It can work very well for helper models (routers, classifiers, retrieval helpers) and lightweight chat, but for small base models (for example 1B–3B) or safety-critical assistants it is usually worth starting with a less aggressive quant like q6_k or q8_0.

Other options

There are many more GGUF quant names in the wild (fast_quantized, q3_k_m, iq2_xxs, etc). Most of them are either presets that pick one of the schemes above for you, or more extreme 2–3 bit formats for very constrained hardware.

If you are just getting started, you can safely ignore them and focus on the four we've covered.

How to choose a GGUF quant

Step 0: check model size

Model size matters when choosing a quant. Smaller models are generally less capable, so heavily compressing them increases the risk of a noticeable quality drop. They also need far less memory and compute, so you can usually afford a stronger, less aggressive quant as your default.

Here are some general guidelines:

≤ 3B

Prefer q8_0 or q6_k as your main export. Only use q4_k_m for very simple classifiers or routers, and only after checking a small eval set.

7–8B

Default to q6_k, but experiment with q5_k_m if memory is tight and your evals still look good.

≥ 12–13B

Choose between q6_k and q5_k_m based on VRAM and cost. q4_k_m is acceptable for non-critical workloads or helper models when you need to squeeze into limited hardware.

Does FP16 vs GGUF affect this?

Not really. The size rules above are about how aggressively you quantize a given base model, based on its parameter count. Your merged FP16 Hugging Face checkpoint is the “truth” model; a GGUF quant is just a compressed version of that checkpoint. The model-size decision is about bits-per-weight × number of parameters, regardless of whether those weights live in a HF folder or a GGUF file.

What about MoE models?

For MoE (Mixture of Experts) models, be conservative as they are a trap for VRM planning. They are spare in compute but dense in VRAM.

While they run fast (they only use a fraction of their parameters per token), you must still load every single expert into GPU memory for them to run.

Ignore the active parameter count and size by them by their total parameter count.

For quantization it's important to remember that MoE's are already sparse (similar to compressed) so they can be more sensitive to aggressive quantization. Avoid q4_k_m and prefer q5_k_m or above to ensure the expert chosen for the token has a useful level of precision.

Thinking vs instruct models

For "thinking" models or reasoning-focused models that generate longer internal chains of thought, treat them as complex, high-error-cost tasks even if they are not customer-facing. Long reasoning chains tend to amplify small numeric differences, so it is safer to stay with q6_k or q8_0. Short instruct-style answers, classification and routing are usually more tolerant of q5_k_m or q4_k_m.

Step 1: check your hardware

Next, sanity-check what your VRAM can hold.

As a rough guide for weights only:

7B in FP16 ≈ 14 GB, in q8_0 ≈ 7 GB, in q4_k_m ≈ 3.5 GB
12B in FP16 ≈ 24 GB, in q8_0 ≈ 12 GB, in q4_k_m ≈ 6 GB

On top of the weights you also need VRAM for KV cache (context length × batch size × layers) and activations, so keep at least 30–50% headroom.

With that in mind:

24–48 GB GPU

For 7B–14B models, q6_k is a very comfortable default.
Keep one q8_0 export around as a high-fidelity reference.
Use q5_k_m if you want longer context or multiple models on the same card.

8–16 GB GPU or mixed CPU/GPU

Use q6_k for the main chat model if it fits with some margin; otherwise drop to q5_k_m.
Use q4_k_m / q5_k_m for helper models (routers, classifiers, retrieval helpers).

CPU-only / edge / laptop

You will almost always want q4_k_m or more aggressive quants here, especially for anything above 3B.
Reserve q6_k / q8_0 for very small models or offline batch jobs where latency is not critical.

Step 2: check the task

The best approach to identifying the right quant for your task is to work back from the following:

How expensive are my mistakes?
How hard is the task?

The general rules to apply here are:

High error cost or complex tasks

Use q6_k or q8_0 even on bigger models.

Examples: Regulated industries, safety-sensitivity, customer-facing, long-form reasoning, code-generation, multi-step flows

Low error cost and simplicity

q4_k_m / q5_k_m are usually fine.

Examples: retrieval strategy classifiers, router models, intent classifiers.

Putting it all together

When you are staring at a list of GGUF quants, you do not need to remember every detail from this post. You can usually get to a sensible choice by running this checklist:

Check model size

≤ 3B: start from q8_0 or q6_k.
7–8B: start from q6_k, consider q5_k_m if needed.
≥ 12–13B (or MoE): choose between q6_k and q5_k_m, keep q4_k_m for helpers and low-risk use.

Check your hardware

24–48 GB GPU: q6_k as default, keep one q8_0 export.
8–16 GB GPU: q6_k if it fits, otherwise q5_k_m; use q4_k_m for helpers.
CPU / edge / laptop: usually q4_k_m or more aggressive, especially above 3B.

Check the task

High error cost or complex reasoning (regulated chat, code, agents, "thinking" models): favour q6_k or q8_0.
Low error cost and simple tasks (routers, intent classifiers, retrieval helpers): q4_k_m / q5_k_m are usually fine.

In practice, it helps to export two variants the first time you deploy a new model, for example:

# Example export loop 
# Ensure you define hf_token or login via huggingface-cli 

quant_methods = ["q8_0", "q6_k"] 
for quant in quant_methods: 
	print(f"Exporting {quant}...") 
	model.push_to_hub_gguf(
		f"{base_repo_id}-{quant}", 
		tokenizer, 
		quantization_method=quant, 
		token=hf_token,
	)

Run a small golden set of prompts through both and pick the cheapest quant that hits your quality bar. Keep the q8_0 (or merged FP16) around as your high-fidelity reference, and use the more compressed quant for day-to-day serving.

Over time you will build your own defaults per project, but this simple three-step loop should be enough to stop "which quant do I pick?" from blocking you every time you export a model.

Advanced: QAT and Dynamic Quants

You might see terms like QAT (Quantization Aware Training) or Dynamic GGUFs in the Unsloth docs. Here is how they fit into this mental model:

Dynamic GGUFs

When you run model.save_pretrained_gguf, Unsloth now defaults to "Dynamic" quantization. This doesn't change your workflow. It just means the "Quant" step is smarter - it keeps sensitive layers in higher precision (e.g., 6-bit) and drops robust layers to lower precision (e.g., 4-bit), giving you better quality at the same file size.

QAT

This is a specific training technique where you simulate quantization errors during training so the model learns to adapt to them. It effectively merges the "Training" and "Quant" steps. This is powerful (recovering up to 70% of lost accuracy) but requires a different training setup using torchao. For now, treat this as a specialised tool for when standard quants aren't good enough.

A guide to model quantization in fine-tuning (and how to pick the right GGUF)

About this post

Only here for the quants?

What is quantization?

What Quantization changes under the hood

Weight quantization

What quantization does not change

Quantization in Unsloth

A simple mental model for model formats

Training

Which one do I choose?

Hugging Face model and merged_16bit

Common GGUF quantization options

q8_0

q6_k

q5_k_m

q4_k_m

Other options

How to choose a GGUF quant

Step 0: check model size

≤ 3B

7–8B

≥ 12–13B

Does FP16 vs GGUF affect this?

Step 1: check your hardware

24–48 GB GPU

8–16 GB GPU or mixed CPU/GPU

CPU-only / edge / laptop

Step 2: check the task

High error cost or complex tasks

Low error cost and simplicity

Putting it all together

Check model size

Check your hardware

Check the task

Advanced: QAT and Dynamic Quants

Dynamic GGUFs

QAT

Hugging Face model and `merged_16bit`

`q8_0`

`q6_k`

`q5_k_m`

`q4_k_m`