How to Fine-Tune DeepSeek OCR V2 on Your Own PDFs — From Install to Inference

How to Fine-Tune DeepSeek OCR V2 on Your Own PDFs — From Install to Inference

A practical, step-by-step guide to running and fine-tuning DeepSeek's 3B-parameter document understanding model on your local machine.


Why DeepSeek OCR V2?

Released January 27, 2026, DeepSeek OCR 2 isn't your typical OCR tool. Traditional OCR scans documents left-to-right, top-to-bottom — like reading a book one pixel row at a time. DeepSeek OCR 2's breakthrough, DeepEncoder V2, reads documents the way humans do: it builds a global understanding of the page layout first, then follows the natural reading order.

The result? Complex tables, multi-column layouts, math equations, and mixed-format documents are handled with state-of-the-art accuracy — all in a model small enough to run on a single GPU.

What makes it worth fine-tuning:

  • Only 3B parameters (runs on 8GB VRAM with quantization)
  • Open-source and fully customizable
  • Fine-tuning has shown 57–86% reduction in Character Error Rate (CER) for domain-specific documents
  • Supports PDF processing out of the box

What You'll Need

Hardware Requirements

Setup VRAM Notes
4-bit quantized 8GB+ Good for experimentation
Full precision 16GB+ Best for production fine-tuning

An NVIDIA GPU with CUDA 11.8+ is required. AMD/ROCm support is still in development as of early 2026.

Software Requirements

  • Python 3.12.9+
  • CUDA 11.8+
  • Git

Step 1: Install DeepSeek OCR 2

Clone the Repository

git clone https://github.com/deepseek-ai/DeepSeek-OCR-2.git
cd DeepSeek-OCR-2

Create Your Environment

conda create -n deepseek-ocr2 python=3.12.9 -y
conda activate deepseek-ocr2

Install Dependencies

# PyTorch with CUDA 11.8
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118

# vLLM (download the 0.8.5 whl from https://github.com/vllm-project/vllm/releases/tag/v0.8.5)
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl

# Project requirements
pip install -r requirements.txt

# Flash Attention (critical for performance)
pip install flash-attn==2.7.3 --no-build-isolation

Note: You may see an installation warning about vllm 0.8.5+cu118 requires transformers>=4.51.1. This is safe to ignore if you're running both vLLM and Transformers inference in the same environment.


Step 2: Run Your First Inference

Before fine-tuning, let's verify everything works by running the base model on a test document.

Option A: Using Transformers (Simplest)

from transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR-2'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation='flash_attention_2',
    trust_remote_code=True,
    use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)

# Convert a document image to structured markdown
prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'test_document.jpg'
output_path = './output'

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_file,
    output_path=output_path,
    base_size=1024,
    image_size=768,
    crop_mode=True,
    save_results=True
)

Option B: Using vLLM (Faster, Production-Ready)

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR-2",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor]
)

image = Image.open("test_document.png").convert("RGB")
prompt = "<image>\n<|grounding|>Convert the document to markdown."

model_input = [
    {"prompt": prompt, "multi_modal_data": {"image": image}}
]

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # <td>, </td>
    ),
    skip_special_tokens=False,
)

outputs = llm.generate(model_input, sampling_params)
print(outputs[0].outputs[0].text)

Pro tip: The NGramPerReqLogitsProcessor prevents a known repetition issue where the model can loop on the same text (similar to Whisper's failure mode). Always include it.

Supported Prompts

Mode Prompt
Structured document → Markdown <image>\n<|grounding|>Convert the document to markdown.
Free text extraction (no layout) <image>\nFree OCR.
Parse figures/charts <image>\nParse the figure.
General image description <image>\nDescribe this image in detail.

Step 3: Process PDFs

DeepSeek OCR 2 includes a built-in PDF processing pipeline with concurrent page handling.

Using the Built-in PDF Script

cd DeepSeek-OCR2-vllm

Edit config.py to set your paths:

INPUT_PATH = "/path/to/your/pdfs"
OUTPUT_PATH = "/path/to/output"

Then run:

python run_dpsk_ocr2_pdf.py

This handles multi-page PDFs with concurrent processing, running at speeds comparable to the original DeepSeek OCR.

DIY: Convert PDFs to Images First

If you prefer more control, convert PDFs to images and process them individually:

from pdf2image import convert_from_path

# Convert PDF pages to images
pages = convert_from_path('your_document.pdf', dpi=300)

for i, page in enumerate(pages):
    page.save(f'page_{i}.png', 'PNG')
    
    # Then run inference on each page
    res = model.infer(
        tokenizer,
        prompt="<image>\n<|grounding|>Convert the document to markdown.",
        image_file=f'page_{i}.png',
        output_path='./output',
        base_size=1024,
        image_size=768,
        crop_mode=True,
        save_results=True
    )

Step 4: Prepare Your Fine-Tuning Dataset

This is where the real value comes in. Fine-tuning lets you adapt the model to your specific document types — invoices, medical records, legal contracts, non-English documents, or anything with a consistent format.

What You Need

For each training example, you need:

  1. An image of the document (or a page from a PDF)
  2. The expected output (the correct text/markdown you want the model to produce)

Dataset Format

dataset = [
    {
        "image": "path/to/document_001.jpg",
        "text": "# Invoice\n\n| Item | Qty | Price |\n|------|-----|-------|\n| Widget A | 10 | $5.00 |",
        "prompt": "<image>\n<|grounding|>Convert the document to markdown."
    },
    {
        "image": "path/to/document_002.jpg",
        "text": "## Contract Agreement\n\nThis agreement is entered into on...",
        "prompt": "<image>\n<|grounding|>Convert the document to markdown."
    },
    # ... more examples
]

How Many Examples?

Start with 100–500 high-quality examples. Quality matters more than quantity — focus on:

  • Diverse layouts within your domain
  • Accurate ground truth text
  • Edge cases (rotated pages, poor scan quality, mixed languages)

Creating Ground Truth from PDFs

If you have digital PDFs (with selectable text), you can bootstrap your dataset:

import fitz  # PyMuPDF
from pdf2image import convert_from_path

def create_training_pair(pdf_path, page_num=0):
    # Extract text as ground truth
    doc = fitz.open(pdf_path)
    page = doc[page_num]
    ground_truth = page.get_text("text")
    
    # Convert page to image
    images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1, dpi=300)
    image_path = f"training_data/page_{page_num}.png"
    images[0].save(image_path, "PNG")
    
    return {
        "image": image_path,
        "text": ground_truth,
        "prompt": "<image>\n<|grounding|>Convert the document to markdown."
    }

Important: Always review and clean up auto-extracted ground truth. Garbage in = garbage out.


Step 5: Fine-Tune with Unsloth

Unsloth is the recommended approach — it's 1.4x faster than standard fine-tuning, uses 40% less VRAM, and supports 5x longer context windows.

Install Unsloth

pip install --upgrade unsloth
pip install --upgrade --force-reinstall --no-deps --no-cache-dir unsloth unsloth_zoo

Download the Model

from huggingface_hub import snapshot_download
snapshot_download("unsloth/DeepSeek-OCR-2", local_dir="deepseek_ocr")

Configure LoRA and Training

from unsloth import FastVisionModel
from trl import SFTTrainer
from transformers import TrainingArguments, AutoModel
import os

os.environ["UNSLOTH_WARN_UNINITIALIZED"] = '0'

# Load model
model, tokenizer = FastVisionModel.from_pretrained(
    "./deepseek_ocr",
    load_in_4bit=True,  # Set False if you have 16GB+ VRAM
    auto_model=AutoModel,
    trust_remote_code=True,
    unsloth_force_compile=True,
    use_gradient_checkpointing="unsloth",
)

# Apply LoRA adapters
model = FastVisionModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# Training configuration
training_args = TrainingArguments(
    output_dir="./deepseek_ocr_finetuned",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
)

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,  # Your prepared dataset
    args=training_args,
)

trainer.train()

# Save
model.save_pretrained("./final_model")

What to Expect

Based on community results from fine-tuning on non-English documents:

Metric Before Fine-Tuning After Fine-Tuning Improvement
Character Error Rate (CER) 1.49–4.19 0.60–0.64 57–86% reduction
Language Understanding Baseline +86–88% Significant

Free option: Unsloth provides a Google Colab notebook where you can fine-tune for free using their free GPU tier.


Step 6: Run Inference with Your Fine-Tuned Model

from unsloth import FastVisionModel
from transformers import AutoModel

model, tokenizer = FastVisionModel.from_pretrained(
    "./final_model",
    load_in_4bit=False,
    auto_model=AutoModel,
    trust_remote_code=True,
)

prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'new_document.jpg'
output_path = './results'

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_file,
    output_path=output_path,
    base_size=1024,
    image_size=768,
    crop_mode=True,
    save_results=True
)

print(res)

Gotchas and Tips

1. Repetition bug: Like Whisper, DeepSeek OCR 2 can sometimes loop and repeat text. When using vLLM, always include the NGramPerReqLogitsProcessor. With Transformers, keep temperature=0.0.

2. Rotated documents: The model handles 90°/180°/270° rotations well, but slight tilts or skews can reduce accuracy. Preprocess with deskewing if your scans aren't clean.

3. VRAM management: With 4-bit quantization + gradient checkpointing via Unsloth, you can fine-tune on a single 8GB GPU. Without quantization, budget 16GB+.

4. Ground truth quality: The single biggest factor in fine-tuning success is the quality of your training labels. Spend time cleaning them — it pays off more than adding more examples.

5. Prompt matters: Use <|grounding|>Convert the document to markdown. for structured output. Use Free OCR. when you just need raw text without layout.


What's Next?

Once you have a fine-tuned model, consider building a pipeline:

PDF Input
  → pdf2image (convert pages)
  → DeepSeek OCR 2 (extract structured text)
  → Post-processing (clean markdown)
  → Vector embeddings (for search/RAG)
  → Storage (pgvector, Pinecone, etc.)

This gives you a fully local, private document processing pipeline — no API calls, no data leaving your servers, and tuned to your exact document types.


DeepSeek OCR 2 is open-source and available on Hugging Face and GitHub.