How to Fine-Tune DeepSeek OCR V2 on Your Own PDFs — From Install to Inference
A practical, step-by-step guide to running and fine-tuning DeepSeek's 3B-parameter document understanding model on your local machine.
Why DeepSeek OCR V2?
Released January 27, 2026, DeepSeek OCR 2 isn't your typical OCR tool. Traditional OCR scans documents left-to-right, top-to-bottom — like reading a book one pixel row at a time. DeepSeek OCR 2's breakthrough, DeepEncoder V2, reads documents the way humans do: it builds a global understanding of the page layout first, then follows the natural reading order.
The result? Complex tables, multi-column layouts, math equations, and mixed-format documents are handled with state-of-the-art accuracy — all in a model small enough to run on a single GPU.
What makes it worth fine-tuning:
- Only 3B parameters (runs on 8GB VRAM with quantization)
- Open-source and fully customizable
- Fine-tuning has shown 57–86% reduction in Character Error Rate (CER) for domain-specific documents
- Supports PDF processing out of the box
What You'll Need
Hardware Requirements
| Setup | VRAM | Notes |
|---|---|---|
| 4-bit quantized | 8GB+ | Good for experimentation |
| Full precision | 16GB+ | Best for production fine-tuning |
An NVIDIA GPU with CUDA 11.8+ is required. AMD/ROCm support is still in development as of early 2026.
Software Requirements
- Python 3.12.9+
- CUDA 11.8+
- Git
Step 1: Install DeepSeek OCR 2
Clone the Repository
git clone https://github.com/deepseek-ai/DeepSeek-OCR-2.git
cd DeepSeek-OCR-2
Create Your Environment
conda create -n deepseek-ocr2 python=3.12.9 -y
conda activate deepseek-ocr2
Install Dependencies
# PyTorch with CUDA 11.8
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
# vLLM (download the 0.8.5 whl from https://github.com/vllm-project/vllm/releases/tag/v0.8.5)
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
# Project requirements
pip install -r requirements.txt
# Flash Attention (critical for performance)
pip install flash-attn==2.7.3 --no-build-isolation
Note: You may see an installation warning about
vllm 0.8.5+cu118 requires transformers>=4.51.1. This is safe to ignore if you're running both vLLM and Transformers inference in the same environment.
Step 2: Run Your First Inference
Before fine-tuning, let's verify everything works by running the base model on a test document.
Option A: Using Transformers (Simplest)
from transformers import AutoModel, AutoTokenizer
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR-2'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation='flash_attention_2',
trust_remote_code=True,
use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)
# Convert a document image to structured markdown
prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'test_document.jpg'
output_path = './output'
res = model.infer(
tokenizer,
prompt=prompt,
image_file=image_file,
output_path=output_path,
base_size=1024,
image_size=768,
crop_mode=True,
save_results=True
)
Option B: Using vLLM (Faster, Production-Ready)
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image
llm = LLM(
model="deepseek-ai/DeepSeek-OCR-2",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor]
)
image = Image.open("test_document.png").convert("RGB")
prompt = "<image>\n<|grounding|>Convert the document to markdown."
model_input = [
{"prompt": prompt, "multi_modal_data": {"image": image}}
]
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(
ngram_size=30,
window_size=90,
whitelist_token_ids={128821, 128822}, # <td>, </td>
),
skip_special_tokens=False,
)
outputs = llm.generate(model_input, sampling_params)
print(outputs[0].outputs[0].text)
Pro tip: The
NGramPerReqLogitsProcessorprevents a known repetition issue where the model can loop on the same text (similar to Whisper's failure mode). Always include it.
Supported Prompts
| Mode | Prompt |
|---|---|
| Structured document → Markdown | <image>\n<|grounding|>Convert the document to markdown. |
| Free text extraction (no layout) | <image>\nFree OCR. |
| Parse figures/charts | <image>\nParse the figure. |
| General image description | <image>\nDescribe this image in detail. |
Step 3: Process PDFs
DeepSeek OCR 2 includes a built-in PDF processing pipeline with concurrent page handling.
Using the Built-in PDF Script
cd DeepSeek-OCR2-vllm
Edit config.py to set your paths:
INPUT_PATH = "/path/to/your/pdfs"
OUTPUT_PATH = "/path/to/output"
Then run:
python run_dpsk_ocr2_pdf.py
This handles multi-page PDFs with concurrent processing, running at speeds comparable to the original DeepSeek OCR.
DIY: Convert PDFs to Images First
If you prefer more control, convert PDFs to images and process them individually:
from pdf2image import convert_from_path
# Convert PDF pages to images
pages = convert_from_path('your_document.pdf', dpi=300)
for i, page in enumerate(pages):
page.save(f'page_{i}.png', 'PNG')
# Then run inference on each page
res = model.infer(
tokenizer,
prompt="<image>\n<|grounding|>Convert the document to markdown.",
image_file=f'page_{i}.png',
output_path='./output',
base_size=1024,
image_size=768,
crop_mode=True,
save_results=True
)
Step 4: Prepare Your Fine-Tuning Dataset
This is where the real value comes in. Fine-tuning lets you adapt the model to your specific document types — invoices, medical records, legal contracts, non-English documents, or anything with a consistent format.
What You Need
For each training example, you need:
- An image of the document (or a page from a PDF)
- The expected output (the correct text/markdown you want the model to produce)
Dataset Format
dataset = [
{
"image": "path/to/document_001.jpg",
"text": "# Invoice\n\n| Item | Qty | Price |\n|------|-----|-------|\n| Widget A | 10 | $5.00 |",
"prompt": "<image>\n<|grounding|>Convert the document to markdown."
},
{
"image": "path/to/document_002.jpg",
"text": "## Contract Agreement\n\nThis agreement is entered into on...",
"prompt": "<image>\n<|grounding|>Convert the document to markdown."
},
# ... more examples
]
How Many Examples?
Start with 100–500 high-quality examples. Quality matters more than quantity — focus on:
- Diverse layouts within your domain
- Accurate ground truth text
- Edge cases (rotated pages, poor scan quality, mixed languages)
Creating Ground Truth from PDFs
If you have digital PDFs (with selectable text), you can bootstrap your dataset:
import fitz # PyMuPDF
from pdf2image import convert_from_path
def create_training_pair(pdf_path, page_num=0):
# Extract text as ground truth
doc = fitz.open(pdf_path)
page = doc[page_num]
ground_truth = page.get_text("text")
# Convert page to image
images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1, dpi=300)
image_path = f"training_data/page_{page_num}.png"
images[0].save(image_path, "PNG")
return {
"image": image_path,
"text": ground_truth,
"prompt": "<image>\n<|grounding|>Convert the document to markdown."
}
Important: Always review and clean up auto-extracted ground truth. Garbage in = garbage out.
Step 5: Fine-Tune with Unsloth
Unsloth is the recommended approach — it's 1.4x faster than standard fine-tuning, uses 40% less VRAM, and supports 5x longer context windows.
Install Unsloth
pip install --upgrade unsloth
pip install --upgrade --force-reinstall --no-deps --no-cache-dir unsloth unsloth_zoo
Download the Model
from huggingface_hub import snapshot_download
snapshot_download("unsloth/DeepSeek-OCR-2", local_dir="deepseek_ocr")
Configure LoRA and Training
from unsloth import FastVisionModel
from trl import SFTTrainer
from transformers import TrainingArguments, AutoModel
import os
os.environ["UNSLOTH_WARN_UNINITIALIZED"] = '0'
# Load model
model, tokenizer = FastVisionModel.from_pretrained(
"./deepseek_ocr",
load_in_4bit=True, # Set False if you have 16GB+ VRAM
auto_model=AutoModel,
trust_remote_code=True,
unsloth_force_compile=True,
use_gradient_checkpointing="unsloth",
)
# Apply LoRA adapters
model = FastVisionModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
# Training configuration
training_args = TrainingArguments(
output_dir="./deepseek_ocr_finetuned",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
)
# Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_dataset, # Your prepared dataset
args=training_args,
)
trainer.train()
# Save
model.save_pretrained("./final_model")
What to Expect
Based on community results from fine-tuning on non-English documents:
| Metric | Before Fine-Tuning | After Fine-Tuning | Improvement |
|---|---|---|---|
| Character Error Rate (CER) | 1.49–4.19 | 0.60–0.64 | 57–86% reduction |
| Language Understanding | Baseline | +86–88% | Significant |
Free option: Unsloth provides a Google Colab notebook where you can fine-tune for free using their free GPU tier.
Step 6: Run Inference with Your Fine-Tuned Model
from unsloth import FastVisionModel
from transformers import AutoModel
model, tokenizer = FastVisionModel.from_pretrained(
"./final_model",
load_in_4bit=False,
auto_model=AutoModel,
trust_remote_code=True,
)
prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'new_document.jpg'
output_path = './results'
res = model.infer(
tokenizer,
prompt=prompt,
image_file=image_file,
output_path=output_path,
base_size=1024,
image_size=768,
crop_mode=True,
save_results=True
)
print(res)
Gotchas and Tips
1. Repetition bug: Like Whisper, DeepSeek OCR 2 can sometimes loop and repeat text. When using vLLM, always include the NGramPerReqLogitsProcessor. With Transformers, keep temperature=0.0.
2. Rotated documents: The model handles 90°/180°/270° rotations well, but slight tilts or skews can reduce accuracy. Preprocess with deskewing if your scans aren't clean.
3. VRAM management: With 4-bit quantization + gradient checkpointing via Unsloth, you can fine-tune on a single 8GB GPU. Without quantization, budget 16GB+.
4. Ground truth quality: The single biggest factor in fine-tuning success is the quality of your training labels. Spend time cleaning them — it pays off more than adding more examples.
5. Prompt matters: Use <|grounding|>Convert the document to markdown. for structured output. Use Free OCR. when you just need raw text without layout.
What's Next?
Once you have a fine-tuned model, consider building a pipeline:
PDF Input
→ pdf2image (convert pages)
→ DeepSeek OCR 2 (extract structured text)
→ Post-processing (clean markdown)
→ Vector embeddings (for search/RAG)
→ Storage (pgvector, Pinecone, etc.)
This gives you a fully local, private document processing pipeline — no API calls, no data leaving your servers, and tuned to your exact document types.
DeepSeek OCR 2 is open-source and available on Hugging Face and GitHub.