Fine-Tuning

How to Fine-Tune DeepSeek OCR V2 on Your Own PDFs — From Install to Inference

Jia Chen

05 Feb 2026 • 6 min read

A practical, step-by-step guide to running and fine-tuning DeepSeek's 3B-parameter document understanding model on your local machine.

Why DeepSeek OCR V2?

Released January 27, 2026, DeepSeek OCR 2 isn't your typical OCR tool. Traditional OCR scans documents left-to-right, top-to-bottom — like reading a book one pixel row at a time. DeepSeek OCR 2's breakthrough, DeepEncoder V2, reads documents the way humans do: it builds a global understanding of the page layout first, then follows the natural reading order.

The result? Complex tables, multi-column layouts, math equations, and mixed-format documents are handled with state-of-the-art accuracy — all in a model small enough to run on a single GPU.

What makes it worth fine-tuning:

Only 3B parameters (runs on 8GB VRAM with quantization)
Open-source and fully customizable
Fine-tuning has shown 57–86% reduction in Character Error Rate (CER) for domain-specific documents
Supports PDF processing out of the box

What You'll Need

Hardware Requirements

Setup	VRAM	Notes
4-bit quantized	8GB+	Good for experimentation
Full precision	16GB+	Best for production fine-tuning

An NVIDIA GPU with CUDA 11.8+ is required. AMD/ROCm support is still in development as of early 2026.

Software Requirements

Python 3.12.9+
CUDA 11.8+
Git

Step 1: Install DeepSeek OCR 2

Clone the Repository

git clone https://github.com/deepseek-ai/DeepSeek-OCR-2.git
cd DeepSeek-OCR-2

Create Your Environment

conda create -n deepseek-ocr2 python=3.12.9 -y
conda activate deepseek-ocr2

Install Dependencies

# PyTorch with CUDA 11.8
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118

# vLLM (download the 0.8.5 whl from https://github.com/vllm-project/vllm/releases/tag/v0.8.5)
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl

# Project requirements
pip install -r requirements.txt

# Flash Attention (critical for performance)
pip install flash-attn==2.7.3 --no-build-isolation

Note: You may see an installation warning about vllm 0.8.5+cu118 requires transformers>=4.51.1. This is safe to ignore if you're running both vLLM and Transformers inference in the same environment.

Step 2: Run Your First Inference

Before fine-tuning, let's verify everything works by running the base model on a test document.

Option A: Using Transformers (Simplest)

from transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR-2'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation='flash_attention_2',
    trust_remote_code=True,
    use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)

# Convert a document image to structured markdown
prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'test_document.jpg'
output_path = './output'

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_file,
    output_path=output_path,
    base_size=1024,
    image_size=768,
    crop_mode=True,
    save_results=True
)

Option B: Using vLLM (Faster, Production-Ready)

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR-2",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image = Image.open("test_document.png").convert("RGB")
prompt = "<image>\n<|grounding|>Convert the document to markdown."

model_input = [
    {"prompt": prompt, "multi_modal_data": {"image": image}}
]

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # <td>, </td>
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(model_input, sampling_params)
print(outputs[0].outputs[0].text)

Pro tip: The NGramPerReqLogitsProcessor prevents a known repetition issue where the model can loop on the same text (similar to Whisper's failure mode). Always include it.

Supported Prompts

Mode	Prompt
Structured document → Markdown	`<image>\n<\|grounding\|>Convert the document to markdown.`
Free text extraction (no layout)	`<image>\nFree OCR.`
Parse figures/charts	`<image>\nParse the figure.`
General image description	`<image>\nDescribe this image in detail.`

Step 3: Process PDFs

DeepSeek OCR 2 includes a built-in PDF processing pipeline with concurrent page handling.

Using the Built-in PDF Script

cd DeepSeek-OCR2-vllm

Edit config.py to set your paths:

INPUT_PATH = "/path/to/your/pdfs"
OUTPUT_PATH = "/path/to/output"

Then run:

python run_dpsk_ocr2_pdf.py

This handles multi-page PDFs with concurrent processing, running at speeds comparable to the original DeepSeek OCR.

DIY: Convert PDFs to Images First

If you prefer more control, convert PDFs to images and process them individually:

from pdf2image import convert_from_path

# Convert PDF pages to images
pages = convert_from_path('your_document.pdf', dpi=300)

for i, page in enumerate(pages):
    page.save(f'page_{i}.png', 'PNG')
    
    # Then run inference on each page
    res = model.infer(
        tokenizer,
        prompt="<image>\n<|grounding|>Convert the document to markdown.",
        image_file=f'page_{i}.png',
        output_path='./output',
        base_size=1024,
        image_size=768,
        crop_mode=True,
        save_results=True
    )

Step 4: Prepare Your Fine-Tuning Dataset

This is where the real value comes in. Fine-tuning lets you adapt the model to your specific document types — invoices, medical records, legal contracts, non-English documents, or anything with a consistent format.

What You Need

For each training example, you need:

An image of the document (or a page from a PDF)
The expected output (the correct text/markdown you want the model to produce)

Dataset Format

dataset = [
    {
        "image": "path/to/document_001.jpg",
        "text": "# Invoice\n\n| Item | Qty | Price |\n|------|-----|-------|\n| Widget A | 10 | $5.00 |",
        "prompt": "<image>\n<|grounding|>Convert the document to markdown."
    },
    {
        "image": "path/to/document_002.jpg",
        "text": "## Contract Agreement\n\nThis agreement is entered into on...",
        "prompt": "<image>\n<|grounding|>Convert the document to markdown."
    },
    # ... more examples
]

How Many Examples?

Start with 100–500 high-quality examples. Quality matters more than quantity — focus on:

Diverse layouts within your domain
Accurate ground truth text
Edge cases (rotated pages, poor scan quality, mixed languages)

Creating Ground Truth from PDFs

If you have digital PDFs (with selectable text), you can bootstrap your dataset:

import fitz  # PyMuPDF
from pdf2image import convert_from_path

def create_training_pair(pdf_path, page_num=0):
    # Extract text as ground truth
    doc = fitz.open(pdf_path)
    page = doc[page_num]
    ground_truth = page.get_text("text")
    
    # Convert page to image
    images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1, dpi=300)
    image_path = f"training_data/page_{page_num}.png"
    images[0].save(image_path, "PNG")
    
    return {
        "image": image_path,
        "text": ground_truth,
        "prompt": "<image>\n<|grounding|>Convert the document to markdown."
    }

Important: Always review and clean up auto-extracted ground truth. Garbage in = garbage out.

Step 5: Fine-Tune with Unsloth

Unsloth is the recommended approach — it's 1.4x faster than standard fine-tuning, uses 40% less VRAM, and supports 5x longer context windows.

Install Unsloth

pip install --upgrade unsloth
pip install --upgrade --force-reinstall --no-deps --no-cache-dir unsloth unsloth_zoo

Download the Model

from huggingface_hub import snapshot_download
snapshot_download("unsloth/DeepSeek-OCR-2", local_dir="deepseek_ocr")

Configure LoRA and Training

from unsloth import FastVisionModel
from trl import SFTTrainer
from transformers import TrainingArguments, AutoModel
import os

os.environ["UNSLOTH_WARN_UNINITIALIZED"] = '0'

# Load model
model, tokenizer = FastVisionModel.from_pretrained(
    "./deepseek_ocr",
    load_in_4bit=True,  # Set False if you have 16GB+ VRAM
    auto_model=AutoModel,
    trust_remote_code=True,
    unsloth_force_compile=True,
    use_gradient_checkpointing="unsloth",
)

# Apply LoRA adapters
model = FastVisionModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# Training configuration
training_args = TrainingArguments(
    output_dir="./deepseek_ocr_finetuned",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
)

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,  # Your prepared dataset
    args=training_args,
)

trainer.train()

# Save
model.save_pretrained("./final_model")

What to Expect

Based on community results from fine-tuning on non-English documents:

Metric	Before Fine-Tuning	After Fine-Tuning	Improvement
Character Error Rate (CER)	1.49–4.19	0.60–0.64	57–86% reduction
Language Understanding	Baseline	+86–88%	Significant

Free option: Unsloth provides a Google Colab notebook where you can fine-tune for free using their free GPU tier.

Step 6: Run Inference with Your Fine-Tuned Model

from unsloth import FastVisionModel
from transformers import AutoModel

model, tokenizer = FastVisionModel.from_pretrained(
    "./final_model",
    load_in_4bit=False,
    auto_model=AutoModel,
    trust_remote_code=True,
)

prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'new_document.jpg'
output_path = './results'

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_file,
    output_path=output_path,
    base_size=1024,
    image_size=768,
    crop_mode=True,
    save_results=True
)

print(res)

Gotchas and Tips

1. Repetition bug: Like Whisper, DeepSeek OCR 2 can sometimes loop and repeat text. When using vLLM, always include the NGramPerReqLogitsProcessor. With Transformers, keep temperature=0.0.

2. Rotated documents: The model handles 90°/180°/270° rotations well, but slight tilts or skews can reduce accuracy. Preprocess with deskewing if your scans aren't clean.

3. VRAM management: With 4-bit quantization + gradient checkpointing via Unsloth, you can fine-tune on a single 8GB GPU. Without quantization, budget 16GB+.

4. Ground truth quality: The single biggest factor in fine-tuning success is the quality of your training labels. Spend time cleaning them — it pays off more than adding more examples.

5. Prompt matters: Use <|grounding|>Convert the document to markdown. for structured output. Use Free OCR. when you just need raw text without layout.

What's Next?

Once you have a fine-tuned model, consider building a pipeline:

PDF Input
  → pdf2image (convert pages)
  → DeepSeek OCR 2 (extract structured text)
  → Post-processing (clean markdown)
  → Vector embeddings (for search/RAG)
  → Storage (pgvector, Pinecone, etc.)

This gives you a fully local, private document processing pipeline — no API calls, no data leaving your servers, and tuned to your exact document types.

DeepSeek OCR 2 is open-source and available on Hugging Face and GitHub.