How to Fine-Tune Kimi K2.5 on Your Local Machine — A Practical Guide
Fine-tuning a 1-trillion-parameter model on consumer GPUs? It's possible — here's how.
What is Kimi K2.5?
Released January 27, 2026 by Moonshot AI, Kimi K2.5 is the most powerful open-source multimodal model available. It's a 1-trillion-parameter Mixture-of-Experts (MoE) model — but thanks to its architecture, only 32 billion parameters activate per token.
Why fine-tune it?
The base model already scores 92.3% on OCRBench (beating GPT-5.2), leads in agentic search benchmarks, and matches frontier models in coding. But fine-tuning lets you specialize it for your domain — legal documents, medical records, customer support, internal tooling — and get dramatically better results on your specific use case.
Key specs:
| Spec | Value |
|---|---|
| Total Parameters | 1 Trillion (MoE) |
| Active Parameters | 32B per token |
| Context Length | 256K tokens |
| Architecture | Modified DeepSeek V3 MoE |
| Vision Encoder | MoonViT (400M params) |
| License | MIT (with attribution) |
| Native Quantization | INT4 |
The Reality Check: Hardware Requirements
Let's be upfront — this is a big model. But MoE architecture + quantization + LoRA make fine-tuning feasible on surprisingly accessible hardware.
For Fine-Tuning (LoRA SFT)
| Component | Minimum | Recommended |
|---|---|---|
| GPU | 2× RTX 4090 (48GB total) | 4× RTX 4090 (96GB total) |
| CPU | x86 with AMX support | Intel Sapphire Rapids |
| RAM | 1TB+ (swap OK) | 2TB system memory |
| Storage | 600GB+ NVMe SSD | 1TB+ NVMe SSD |
For Inference After Fine-Tuning
| Component | Minimum |
|---|---|
| GPU | 2× RTX 4090 (48GB total) |
| CPU | x86 with AVX512F support |
| RAM | 600GB+ |
The trick: KTransformers offloads MoE expert layers to CPU/RAM while keeping attention layers on GPU. This is what makes consumer-grade fine-tuning possible. The official benchmark shows 44.55 tokens/s throughput for LoRA SFT on 2× RTX 4090 + Intel 8488C.
Don't have this hardware? You can also:
- Run quantized inference (1.58-bit) with just 240GB combined RAM+VRAM via llama.cpp
- Use the Kimi K2.5 API ($0.60/M input tokens, $3/M output tokens) for inference
- Fine-tune on cloud GPUs (AWS p5.48xlarge ~$40-60/hour)
Step 1: Set Up Your Environment
You'll need two separate conda environments — one for training and one for inference. This avoids dependency conflicts.
Training Environment
# Create training environment
conda create -n kt-sft python=3.11
conda activate kt-sft
# Install LLaMA-Factory (training framework)
git clone https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory
pip install -e .
# Install CUDA and compiler dependencies
conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime
# Install KTransformers (CPU+GPU hybrid backend)
# Get matching wheels from: https://github.com/kvcache-ai/ktransformers/releases
pip install ktransformers-<matching-version>.whl
pip install flash_attn-<matching-version>.whl
Inference Environment
# Create inference environment
conda create -n kt-kernel python=3.11
conda activate kt-kernel
# Install KTransformers
git clone https://github.com/kvcache-ai/ktransformers.git
git checkout kimi_k2.5
git submodule update --init --recursive
cd kt-kernel && ./install.sh
# Install SGLang (for serving)
git clone https://github.com/kvcache-ai/sglang.git
cd sglang
git checkout kimi_k2.5
pip install -e "python[all]"
Step 2: Download the Model
# Download Kimi-K2.5 (INT4 format — ~600GB)
huggingface-cli download moonshotai/Kimi-K2.5 \
--local-dir /path/to/kimi-k2.5
Storage note: The full model is ~600GB. Make sure you have enough space on a fast NVMe drive.
Convert INT4 → BF16 for Training
KTransformers requires BF16 weights for LoRA SFT. The model ships in INT4, so you need to convert:
# Convert to BF16 (required for fine-tuning)
# Follow KTransformers documentation for the conversion script
python convert_weights.py \
--input /path/to/kimi-k2.5 \
--output /path/to/kimi-k2.5-bf16 \
--format bf16
Step 3: Prepare Your Training Data
Kimi K2.5 expects conversation-format data in JSONL. Each line is a training example with roles and content.
Dataset Format
{"conversations": [
{"role": "system", "content": "You are a legal document analyst specializing in contract review."},
{"role": "user", "content": "Review this contract clause and identify potential risks: [clause text]"},
{"role": "assistant", "content": "I've identified three key risks in this clause:\n\n1. **Liability limitation** — The cap of $10,000 is unusually low for...\n2. **Termination clause** — The 30-day notice period only applies to...\n3. **IP assignment** — The broad language 'all work product' could include..."}
]}
For Vision/Document Tasks
If you're fine-tuning for document understanding, include image references:
{"conversations": [
{"role": "user", "content": [
{"type": "text", "text": "Extract all line items from this invoice."},
{"type": "image_url", "image_url": {"url": "file:///path/to/invoice_001.png"}}
]},
{"role": "assistant", "content": "| Item | Quantity | Unit Price | Total |\n|------|----------|------------|-------|\n| Widget A | 100 | $5.00 | $500.00 |\n| Widget B | 50 | $12.00 | $600.00 |\n\n**Subtotal:** $1,100.00\n**Tax (8.5%):** $93.50\n**Total:** $1,193.50"}
]}
How Much Data?
- Minimum: 100-200 high-quality examples for basic domain adaptation
- Recommended: 500-2,000 examples for strong specialization
- Quality over quantity — clean, accurate examples matter more than volume
LLaMA-Factory supports several dataset formats. Register your dataset in data/dataset_info.json:
{
"my_custom_dataset": {
"file_name": "my_training_data.jsonl",
"formatting": "sharegpt",
"columns": {
"messages": "conversations"
}
}
}
Step 4: Configure LoRA Fine-Tuning
Create your training YAML configuration. LoRA (Low-Rank Adaptation) only trains a small fraction of the model's parameters while preserving the base model's knowledge.
Training Config
Create examples/train_lora/kimik2_lora_sft_kt.yaml:
# Model
model_name_or_path: /path/to/kimi-k2.5-bf16
# Training method
stage: sft
finetuning_type: lora
bf16: true
# LoRA configuration
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.1
lora_target: q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
# KTransformers backend (CPU+GPU hybrid)
use_kt: true
kt_optimize_rule: <rule.yaml>
cpu_infer: 32
chunk_size: 8192
# Dataset
dataset: my_custom_dataset
template: kimi
# Training parameters
output_dir: ./output/kimi-k2.5-lora
num_train_epochs: 3
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 2e-5
warmup_steps: 100
logging_steps: 10
save_steps: 200
# Memory optimization
gradient_checkpointing: true
What Each LoRA Parameter Does
| Parameter | Value | Why |
|---|---|---|
lora_rank |
16 | Higher = more capacity, more VRAM. 16 is a good balance. |
lora_alpha |
32 | Scaling factor. Rule of thumb: 2× the rank. |
lora_dropout |
0.1 | Prevents overfitting on small datasets. |
lora_target |
attention + MLP | Targets the layers that matter most for adaptation. |
Step 5: Run Fine-Tuning
conda activate kt-sft
cd LlamaFactory
USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml
Your LoRA adapter will be saved to ./output/kimi-k2.5-lora.
What to Expect
- Training speed: ~44.55 tokens/s on 2× RTX 4090
- Time: Depends on dataset size. 500 examples × 3 epochs ≈ a few hours
- Output size: LoRA adapters are small (typically 100-500MB vs 600GB base model)
- Watch for: Loss curves in your logs — if loss stops decreasing after epoch 1, you may be overfitting
Step 6: Verify Your Fine-Tuned Model
Before deploying, run a quick sanity check:
conda activate kt-sft
cd LlamaFactory
llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml
This launches an interactive chat where you can test your fine-tuned model. Try prompts from your domain and compare with the base model's responses.
Step 7: Deploy for Production
Convert LoRA for SGLang Serving
# Convert LoRA adapter for SGLang compatibility
python ktransformers/kt-kernel/scripts/convert_lora.py \
--base_path /path/to/kimi-k2.5 \
--lora_path ./output/kimi-k2.5-lora \
--output_path ./output/lora_converted
Optional: Compress CPU Weights to INT8
This reduces memory usage for inference:
python ktransformers/kt-kernel/scripts/convert_cpu_weights.py \
--base_path /path/to/kimi-k2.5 \
--output_dir /path/to/kimi-k2.5-int8
Launch the Server
conda activate kt-kernel
python -m sglang.launch_server \
--enable-lora \
--lora-paths my_adapter=/path/to/lora_converted \
--lora-backend triton \
--model-path /path/to/kimi-k2.5 \
--tp 2 \
--trust-remote-code \
--context-length 4096 \
--kt-weight-path /path/to/kimi-k2.5-int8/int8 \
--mem-fraction-static 0.9
Your fine-tuned model is now served as an OpenAI-compatible API at http://localhost:30000.
Test the API
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="default",
messages=[
{"role": "user", "content": "Your domain-specific prompt here"}
],
temperature=1.0,
top_p=0.95,
)
print(response.choices[0].message.content)
Alternative: Running Quantized Inference with llama.cpp
If you don't need fine-tuning and just want to run Kimi K2.5 locally for inference, you can use Unsloth's GGUF quantizations with llama.cpp:
# Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j \
--target llama-cli llama-server
# Run the 1.58-bit quant (~240GB)
LLAMA_SET_ROWS=1 ./llama.cpp/build/bin/llama-cli \
-hf unsloth/Kimi-K2.5-GGUF:UD-TQ1_0 \
--temp 1.0 \
--min-p 0.01 \
--top-p 0.95 \
--ctx-size 16384 \
--fit on \
--jinja
Available quantizations from Unsloth:
| Quant | Size | Best For |
|---|---|---|
| UD-TQ1_0 (1.58-bit) | ~240GB | Minimum viable, experimentation |
| UD-Q2_K_XL (2-bit) | ~375GB | Good quality/size balance |
| UD-Q4_K_XL (4-bit) | ~630GB | Near full precision |
Kimi K2.5 vs DeepSeek OCR 2: Which Should You Fine-Tune?
If you're deciding between the two (we covered DeepSeek OCR 2 fine-tuning in our previous post):
| DeepSeek OCR 2 | Kimi K2.5 | |
|---|---|---|
| Parameters | 3B | 1T (32B active) |
| GPU needed | Single 8GB GPU | 2-4× RTX 4090 minimum |
| Best for | Document OCR, text extraction | Everything — coding, vision, reasoning, agents |
| Fine-tune time | Hours | Hours to days |
| Use case | "Extract text from 10K invoices" | "Build an AI that understands our entire business" |
Rule of thumb: If your task is primarily document reading/extraction, use DeepSeek OCR 2. If you need a general-purpose AI that can reason, code, use tools, AND understand documents, fine-tune Kimi K2.5.
Tips and Gotchas
-
Sampling matters. Kimi K2.5 performs best with
temperature=1.0,top_p=0.95,min_p=0.01. These are unusually high settings but are officially recommended by Moonshot AI. -
Don't skip the BF16 conversion. The model ships in INT4. You must convert to BF16 before LoRA SFT. Trying to fine-tune on INT4 weights directly will fail.
-
LoRA adapters are portable. Your fine-tuned adapter is only a few hundred MB. You can share it, version it, and swap adapters without redownloading the 600GB base model.
-
Vision fine-tuning note. As of early 2026, llama.cpp GGUF doesn't support Kimi K2.5's MoonViT vision encoder yet. For vision tasks, use vLLM or SGLang with the full model.
-
Memory management. If you hit OOM during training, reduce
chunk_sizein the YAML config, or increasegradient_accumulation_stepswhile decreasingper_device_train_batch_size. -
Thinking mode. Kimi K2.5 has thinking mode enabled by default. For fine-tuning, decide upfront whether you want your model to think (use
temperature=1.0) or respond instantly (usetemperature=0.6and disable thinking in the template).
What's Next?
Once fine-tuned, Kimi K2.5 becomes a powerful foundation for:
- Domain-specific AI assistants — legal, medical, financial
- Agentic workflows — using Agent Swarm to parallelize complex tasks
- Visual coding — generating code from screenshots and designs
- Document processing pipelines — combined with OCR for end-to-end automation
The model's MIT license means you can deploy commercially with attribution. No API costs, no data leaving your servers, full control.
Kimi K2.5 is available on Hugging Face and through the Moonshot AI API. Fine-tuning guide based on the official KTransformers SFT documentation.