How to Fine-Tune Kimi K2.5 on Your Local Machine — A Practical Guide

How to Fine-Tune Kimi K2.5 on Your Local Machine — A Practical Guide

Fine-tuning a 1-trillion-parameter model on consumer GPUs? It's possible — here's how.


What is Kimi K2.5?

Released January 27, 2026 by Moonshot AI, Kimi K2.5 is the most powerful open-source multimodal model available. It's a 1-trillion-parameter Mixture-of-Experts (MoE) model — but thanks to its architecture, only 32 billion parameters activate per token.

Why fine-tune it?

The base model already scores 92.3% on OCRBench (beating GPT-5.2), leads in agentic search benchmarks, and matches frontier models in coding. But fine-tuning lets you specialize it for your domain — legal documents, medical records, customer support, internal tooling — and get dramatically better results on your specific use case.

Key specs:

Spec Value
Total Parameters 1 Trillion (MoE)
Active Parameters 32B per token
Context Length 256K tokens
Architecture Modified DeepSeek V3 MoE
Vision Encoder MoonViT (400M params)
License MIT (with attribution)
Native Quantization INT4

The Reality Check: Hardware Requirements

Let's be upfront — this is a big model. But MoE architecture + quantization + LoRA make fine-tuning feasible on surprisingly accessible hardware.

For Fine-Tuning (LoRA SFT)

Component Minimum Recommended
GPU 2× RTX 4090 (48GB total) 4× RTX 4090 (96GB total)
CPU x86 with AMX support Intel Sapphire Rapids
RAM 1TB+ (swap OK) 2TB system memory
Storage 600GB+ NVMe SSD 1TB+ NVMe SSD

For Inference After Fine-Tuning

Component Minimum
GPU 2× RTX 4090 (48GB total)
CPU x86 with AVX512F support
RAM 600GB+

The trick: KTransformers offloads MoE expert layers to CPU/RAM while keeping attention layers on GPU. This is what makes consumer-grade fine-tuning possible. The official benchmark shows 44.55 tokens/s throughput for LoRA SFT on 2× RTX 4090 + Intel 8488C.

Don't have this hardware? You can also:

  • Run quantized inference (1.58-bit) with just 240GB combined RAM+VRAM via llama.cpp
  • Use the Kimi K2.5 API ($0.60/M input tokens, $3/M output tokens) for inference
  • Fine-tune on cloud GPUs (AWS p5.48xlarge ~$40-60/hour)

Step 1: Set Up Your Environment

You'll need two separate conda environments — one for training and one for inference. This avoids dependency conflicts.

Training Environment

# Create training environment
conda create -n kt-sft python=3.11
conda activate kt-sft

# Install LLaMA-Factory (training framework)
git clone https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory
pip install -e .

# Install CUDA and compiler dependencies
conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime

# Install KTransformers (CPU+GPU hybrid backend)
# Get matching wheels from: https://github.com/kvcache-ai/ktransformers/releases
pip install ktransformers-<matching-version>.whl
pip install flash_attn-<matching-version>.whl

Inference Environment

# Create inference environment
conda create -n kt-kernel python=3.11
conda activate kt-kernel

# Install KTransformers
git clone https://github.com/kvcache-ai/ktransformers.git
git checkout kimi_k2.5
git submodule update --init --recursive
cd kt-kernel && ./install.sh

# Install SGLang (for serving)
git clone https://github.com/kvcache-ai/sglang.git
cd sglang
git checkout kimi_k2.5
pip install -e "python[all]"

Step 2: Download the Model

# Download Kimi-K2.5 (INT4 format — ~600GB)
huggingface-cli download moonshotai/Kimi-K2.5 \
  --local-dir /path/to/kimi-k2.5

Storage note: The full model is ~600GB. Make sure you have enough space on a fast NVMe drive.

Convert INT4 → BF16 for Training

KTransformers requires BF16 weights for LoRA SFT. The model ships in INT4, so you need to convert:

# Convert to BF16 (required for fine-tuning)
# Follow KTransformers documentation for the conversion script
python convert_weights.py \
  --input /path/to/kimi-k2.5 \
  --output /path/to/kimi-k2.5-bf16 \
  --format bf16

Step 3: Prepare Your Training Data

Kimi K2.5 expects conversation-format data in JSONL. Each line is a training example with roles and content.

Dataset Format

{"conversations": [
  {"role": "system", "content": "You are a legal document analyst specializing in contract review."},
  {"role": "user", "content": "Review this contract clause and identify potential risks: [clause text]"},
  {"role": "assistant", "content": "I've identified three key risks in this clause:\n\n1. **Liability limitation** — The cap of $10,000 is unusually low for...\n2. **Termination clause** — The 30-day notice period only applies to...\n3. **IP assignment** — The broad language 'all work product' could include..."}
]}

For Vision/Document Tasks

If you're fine-tuning for document understanding, include image references:

{"conversations": [
  {"role": "user", "content": [
    {"type": "text", "text": "Extract all line items from this invoice."},
    {"type": "image_url", "image_url": {"url": "file:///path/to/invoice_001.png"}}
  ]},
  {"role": "assistant", "content": "| Item | Quantity | Unit Price | Total |\n|------|----------|------------|-------|\n| Widget A | 100 | $5.00 | $500.00 |\n| Widget B | 50 | $12.00 | $600.00 |\n\n**Subtotal:** $1,100.00\n**Tax (8.5%):** $93.50\n**Total:** $1,193.50"}
]}

How Much Data?

  • Minimum: 100-200 high-quality examples for basic domain adaptation
  • Recommended: 500-2,000 examples for strong specialization
  • Quality over quantity — clean, accurate examples matter more than volume

LLaMA-Factory supports several dataset formats. Register your dataset in data/dataset_info.json:

{
  "my_custom_dataset": {
    "file_name": "my_training_data.jsonl",
    "formatting": "sharegpt",
    "columns": {
      "messages": "conversations"
    }
  }
}

Step 4: Configure LoRA Fine-Tuning

Create your training YAML configuration. LoRA (Low-Rank Adaptation) only trains a small fraction of the model's parameters while preserving the base model's knowledge.

Training Config

Create examples/train_lora/kimik2_lora_sft_kt.yaml:

# Model
model_name_or_path: /path/to/kimi-k2.5-bf16

# Training method
stage: sft
finetuning_type: lora
bf16: true

# LoRA configuration
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.1
lora_target: q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj

# KTransformers backend (CPU+GPU hybrid)
use_kt: true
kt_optimize_rule: <rule.yaml>
cpu_infer: 32
chunk_size: 8192

# Dataset
dataset: my_custom_dataset
template: kimi

# Training parameters
output_dir: ./output/kimi-k2.5-lora
num_train_epochs: 3
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 2e-5
warmup_steps: 100
logging_steps: 10
save_steps: 200

# Memory optimization
gradient_checkpointing: true

What Each LoRA Parameter Does

Parameter Value Why
lora_rank 16 Higher = more capacity, more VRAM. 16 is a good balance.
lora_alpha 32 Scaling factor. Rule of thumb: 2× the rank.
lora_dropout 0.1 Prevents overfitting on small datasets.
lora_target attention + MLP Targets the layers that matter most for adaptation.

Step 5: Run Fine-Tuning

conda activate kt-sft
cd LlamaFactory

USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml

Your LoRA adapter will be saved to ./output/kimi-k2.5-lora.

What to Expect

  • Training speed: ~44.55 tokens/s on 2× RTX 4090
  • Time: Depends on dataset size. 500 examples × 3 epochs ≈ a few hours
  • Output size: LoRA adapters are small (typically 100-500MB vs 600GB base model)
  • Watch for: Loss curves in your logs — if loss stops decreasing after epoch 1, you may be overfitting

Step 6: Verify Your Fine-Tuned Model

Before deploying, run a quick sanity check:

conda activate kt-sft
cd LlamaFactory

llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml

This launches an interactive chat where you can test your fine-tuned model. Try prompts from your domain and compare with the base model's responses.


Step 7: Deploy for Production

Convert LoRA for SGLang Serving

# Convert LoRA adapter for SGLang compatibility
python ktransformers/kt-kernel/scripts/convert_lora.py \
  --base_path /path/to/kimi-k2.5 \
  --lora_path ./output/kimi-k2.5-lora \
  --output_path ./output/lora_converted

Optional: Compress CPU Weights to INT8

This reduces memory usage for inference:

python ktransformers/kt-kernel/scripts/convert_cpu_weights.py \
  --base_path /path/to/kimi-k2.5 \
  --output_dir /path/to/kimi-k2.5-int8

Launch the Server

conda activate kt-kernel

python -m sglang.launch_server \
  --enable-lora \
  --lora-paths my_adapter=/path/to/lora_converted \
  --lora-backend triton \
  --model-path /path/to/kimi-k2.5 \
  --tp 2 \
  --trust-remote-code \
  --context-length 4096 \
  --kt-weight-path /path/to/kimi-k2.5-int8/int8 \
  --mem-fraction-static 0.9

Your fine-tuned model is now served as an OpenAI-compatible API at http://localhost:30000.

Test the API

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "user", "content": "Your domain-specific prompt here"}
    ],
    temperature=1.0,
    top_p=0.95,
)

print(response.choices[0].message.content)

Alternative: Running Quantized Inference with llama.cpp

If you don't need fine-tuning and just want to run Kimi K2.5 locally for inference, you can use Unsloth's GGUF quantizations with llama.cpp:

# Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
  -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j \
  --target llama-cli llama-server

# Run the 1.58-bit quant (~240GB)
LLAMA_SET_ROWS=1 ./llama.cpp/build/bin/llama-cli \
  -hf unsloth/Kimi-K2.5-GGUF:UD-TQ1_0 \
  --temp 1.0 \
  --min-p 0.01 \
  --top-p 0.95 \
  --ctx-size 16384 \
  --fit on \
  --jinja

Available quantizations from Unsloth:

Quant Size Best For
UD-TQ1_0 (1.58-bit) ~240GB Minimum viable, experimentation
UD-Q2_K_XL (2-bit) ~375GB Good quality/size balance
UD-Q4_K_XL (4-bit) ~630GB Near full precision

Kimi K2.5 vs DeepSeek OCR 2: Which Should You Fine-Tune?

If you're deciding between the two (we covered DeepSeek OCR 2 fine-tuning in our previous post):

DeepSeek OCR 2 Kimi K2.5
Parameters 3B 1T (32B active)
GPU needed Single 8GB GPU 2-4× RTX 4090 minimum
Best for Document OCR, text extraction Everything — coding, vision, reasoning, agents
Fine-tune time Hours Hours to days
Use case "Extract text from 10K invoices" "Build an AI that understands our entire business"

Rule of thumb: If your task is primarily document reading/extraction, use DeepSeek OCR 2. If you need a general-purpose AI that can reason, code, use tools, AND understand documents, fine-tune Kimi K2.5.


Tips and Gotchas

  1. Sampling matters. Kimi K2.5 performs best with temperature=1.0, top_p=0.95, min_p=0.01. These are unusually high settings but are officially recommended by Moonshot AI.

  2. Don't skip the BF16 conversion. The model ships in INT4. You must convert to BF16 before LoRA SFT. Trying to fine-tune on INT4 weights directly will fail.

  3. LoRA adapters are portable. Your fine-tuned adapter is only a few hundred MB. You can share it, version it, and swap adapters without redownloading the 600GB base model.

  4. Vision fine-tuning note. As of early 2026, llama.cpp GGUF doesn't support Kimi K2.5's MoonViT vision encoder yet. For vision tasks, use vLLM or SGLang with the full model.

  5. Memory management. If you hit OOM during training, reduce chunk_size in the YAML config, or increase gradient_accumulation_steps while decreasing per_device_train_batch_size.

  6. Thinking mode. Kimi K2.5 has thinking mode enabled by default. For fine-tuning, decide upfront whether you want your model to think (use temperature=1.0) or respond instantly (use temperature=0.6 and disable thinking in the template).


What's Next?

Once fine-tuned, Kimi K2.5 becomes a powerful foundation for:

  • Domain-specific AI assistants — legal, medical, financial
  • Agentic workflows — using Agent Swarm to parallelize complex tasks
  • Visual coding — generating code from screenshots and designs
  • Document processing pipelines — combined with OCR for end-to-end automation

The model's MIT license means you can deploy commercially with attribution. No API costs, no data leaving your servers, full control.


Kimi K2.5 is available on Hugging Face and through the Moonshot AI API. Fine-tuning guide based on the official KTransformers SFT documentation.