How to tune your own LLM with GRPO, Common Crawl and Unsloth

Jia Chen

10 Mar 2026 • 9 min read

If you’ve been following the AI space at all, you know that fine-tuning your own LLM is no longer a luxury reserved for teams with massive GPU clusters and seven-figure compute budgets. The tooling has matured to a point where you can fine-tune a capable model on your own data, using free resources, in an afternoon.

In this post, I’m going to walk you through how to do exactly that — using Common Crawl as a data source, Unsloth as the training framework, GRPO as the reinforcement learning method, and Google Colab as the compute environment. We’ve been building and deploying fine-tuned models in production since 2019, and this workflow is close to what we actually use when prototyping new specialist models for clients.

Let’s get into it.

The Stack: What We’re Working With

Before we dive in, here’s a quick overview of the tools and why each one matters.

Common Crawl — Your Data Source

Common Crawl is a nonprofit that maintains an open repository of web crawl data — over 300 billion pages spanning nearly two decades. It’s the same data source that was used to build many of the training datasets behind today’s biggest language models, including parts of The Pile, C4, and RefinedWeb. If you need large-scale, diverse text data for training or fine-tuning, this is one of the best places to start.

For our purposes, Common Crawl gives us access to raw web data that we can filter, clean, and shape into a domain-specific training corpus. Want to build a model that understands medical literature? Legal filings? Financial reports? You can extract that from Common Crawl. The data is stored in WARC format on AWS S3, and it’s free to access.

Common Crawl - Open Repository of Web Crawl Data

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

Common Crawl Logo

Unsloth — The Fine-Tuning Framework

Unsloth is an open-source framework specifically built for LLM fine-tuning and reinforcement learning. The headline number is that it makes training roughly 2x faster while using about 70% less VRAM compared to standard approaches. That matters a lot when you’re working on a free Colab T4 GPU with 16GB of VRAM — every byte counts.

Unsloth supports a wide range of models including Llama, Gemma, Qwen, DeepSeek, and more. It handles LoRA/QLoRA fine-tuning out of the box, and recently added solid support for reinforcement learning via GRPO. Their notebook ecosystem is also excellent — well-documented, actively maintained, and designed to run on Colab without modifications.

Google Colab — Free Compute

Google Colab gives you access to a T4 GPU for free, which is enough to fine-tune small to mid-sized models (up to around 7B parameters with quantization). It’s not going to replace a proper training cluster, but for prototyping, learning, and building proof-of-concept models, it’s hard to beat. We use Colab regularly for early experimentation before moving to dedicated infrastructure for production runs.

What Is GRPO and Why Should You Care?

This is where things get interesting. GRPO — Group Relative Policy Optimization — is a reinforcement learning technique developed by DeepSeek for training their R1 reasoning models. If you’ve used or heard of DeepSeek-R1, GRPO is a big part of what makes it tick.

To understand why GRPO matters, it helps to know what came before it. The standard approach to RL for language models has been RLHF (Reinforcement Learning from Human Feedback), which uses PPO (Proximal Policy Optimization). PPO works, but it’s computationally expensive because it requires three separate models running simultaneously: the model you’re training (the policy), a reference model (to prevent drift), and a value model (to estimate how good each action is). On top of that, you need a trained reward model. That’s a lot of moving parts and a lot of VRAM.

GRPO simplifies this significantly. It removes the value model entirely — instead of learning a separate model to estimate value, it generates multiple completions for each prompt and uses the relative quality within that group to compute advantages. Give the model a math problem, it generates say 8 different answers. The ones that score higher get reinforced; the ones that score lower get penalized. The “group relative” part means the model is always comparing against its own outputs within each batch, not against some separate learned estimate.

GRPO also removes the need for a separate reward model. Instead, you define simple, verifiable reward functions — did the model get the math problem right? Did the code compile? Did the output follow the specified format? This is sometimes called RLVR (Reinforcement Learning with Verifiable Rewards). It’s powerful because writing a reward function is much easier than collecting thousands of human preference labels.

The practical upside: GRPO uses dramatically less memory and compute than PPO while achieving comparable or better results for tasks where you can define clear reward signals. That’s what makes it feasible to run on a single Colab GPU.

https://arxiv.org/pdf/2401.02954

Putting It All Together: Step by Step

Enough theory. Let’s walk through the actual code. I’m pulling from the Unsloth Gemma 3 GRPO notebook linked above — you can open it directly in Colab and follow along.

Step 1: Install Unsloth

First cell in the notebook handles installation. If you’re on Colab, this takes care of everything:

%%capture
import os
os.environ["UNSLOTH_VLLM_STANDBY"] = "1"
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    pass  # Colab-specific install handled separately

Step 2: Load the Model

Next, load the base model and configure LoRA adapters. We’re using Gemma 3 1B Instruct here — small enough to fit on a free T4 with room to spare:

from unsloth import FastModel
import torch
max_seq_length = 1024

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-1b-it",
    max_seq_length = max_seq_length,
    load_in_4bit = False,
    load_in_8bit = False,
    full_finetuning = False,
)

Then add LoRA adapters. This is what makes fine-tuning efficient — instead of updating all the model’s parameters, you’re only training a small set of adapter weights:

model = FastModel.get_peft_model(
    model,
    finetune_vision_layers = False,
    finetune_language_layers = True,
    finetune_attention_modules = True,
    finetune_mlp_modules = True,
    r = 8,           # LoRA rank - larger = more capacity
    lora_alpha = 8,  # Recommended: alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Step 3: Prepare Your Data

The notebook uses OpenAI’s GSM8K math dataset as an example. For your own project, you’d swap this out with your domain-specific data from Common Crawl or wherever your data lives. The key thing is structuring it as prompts the model can generate completions for:

from datasets import load_dataset
dataset = load_dataset("openai/gsm8k", "main", split="train")

You’ll also want to set up a system prompt that tells the model what format to use. The notebook defines special tokens for reasoning and solution sections:

reasoning_start = "<start_working_out>"
reasoning_end   = "<end_working_out>"
solution_start  = "<SOLUTION>"
solution_end    = "</SOLUTION>"

system_prompt = f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""

Then map the dataset into the chat format the model expects:

dataset = dataset.map(lambda x: {
    "prompt": [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": x["question"]},
    ],
    "answer": extract_hash_answer(x["answer"]),
})

Step 4: Define Reward Functions

This is the heart of GRPO. You need to tell the training loop what “good” looks like. The notebook defines several reward functions that stack together — the model gets points for following the format, and more points for getting the right answer:

def match_format_exactly(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        if match_format.search(response) is not None:
            score += 3.0
        scores.append(score)
    return scores

def check_answer(prompts, completions, answer, **kwargs):
    responses = [c[0]["content"] for c in completions]
    extracted = [
        guess.group(1) if (guess := match_format.search(r)) else None
        for r in responses
    ]
    scores = []
    for guess, true_answer in zip(extracted, answer):
        if guess is None:
            scores.append(0)
            continue
        if guess == true_answer:
            scores.append(3.0)    # Correct answer: 3 points
        elif guess.strip() == true_answer.strip():
            scores.append(1.5)    # Close match: 1.5 points
        else:
            scores.append(-1.0)   # Wrong: penalize
    return scores

Notice what’s happening here: the reward functions are just plain Python. No trained reward model, no human annotators. The model gets 3 points for a correct answer, 1.5 for close, and loses a point for wrong. You can adapt this pattern to any domain — check JSON schema compliance, verify code output, validate extracted fields, whatever your use case needs.

Step 5: Configure and Run Training

Set up the GRPO training configuration and let it run:

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_torch_fused",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1,
    num_generations = 4,     # How many completions per prompt
    max_prompt_length = 256,
    max_completion_length = max_seq_length - 256,
    max_steps = 50,          # Increase for more training
    save_steps = 50,
    max_grad_norm = 0.1,
    report_to = "none",
    output_dir = "outputs",
)

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        match_format_exactly,
        match_format_approximately,
        check_answer,
        check_numbers,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

A key parameter here is num_generations — this is how many completions GRPO generates per prompt for the group comparison. 4 is a good starting point on a T4. If you have more VRAM, bump it up to 8 or 16 for better gradient estimates. Watch the reward column in the training log — it should trend upward over time. Don’t panic if it sits at 0 for the first 100 steps; that’s normal.

Step 6: Test and Save

After training, test the model with a quick inference call:

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "What is the sqrt of 101?"},
]
text = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False,
)
from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    max_new_tokens = 64,
    temperature = 1.0,
    top_p = 0.95,
    top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt=True),
)

Save the LoRA adapters locally or push to Hugging Face:

# Save locally
model.save_pretrained("gemma_3_lora")
tokenizer.save_pretrained("gemma_3_lora")

# Or push to Hugging Face
# model.push_to_hub("YOUR_HF_ACCOUNT/gemma_3_lora", token="YOUR_HF_TOKEN")

If you want to deploy with Ollama or llama.cpp, you can export to GGUF format:

# Export to GGUF for use with Ollama / llama.cpp
model.save_pretrained_gguf(
    "gemma_3_finetune",
    tokenizer,
    quantization_method = "Q8_0",  # Q8_0, BF16, F16 supported
)

Adapting This to Your Own Data

The notebook above uses GSM8K as an example, but the real value is when you bring your own data. If you’re pulling from Common Crawl, the workflow looks something like:

Use the Common Crawl index to identify pages in your target domain. Download the relevant WARC segments from S3. Extract and clean the text using tools like trafilatura or resiliparse. Structure it into prompt-completion pairs suitable for your task. Define reward functions that measure what matters for your domain.

The reward function design is where you’ll spend most of your iteration time. For a structured extraction task, your reward might check whether the output is valid JSON matching your schema. For a classification task, it might compare against known labels. For a formatting task, it might verify the output follows your specified template. Start simple and add complexity as you see where the model struggles.

Why This Matters

We’ve been doing production AI work since 2019, and the single biggest shift we’ve seen isn’t better models — it’s better accessibility. Three years ago, fine-tuning a language model required serious infrastructure and deep expertise. Today, the combination of tools like Unsloth, free compute from Colab, and efficient training methods like GRPO means that a single engineer can go from idea to working fine-tuned model in a day.

That doesn’t mean it’s trivial. The hard parts — data curation, reward function design, evaluation, deployment — still require real expertise and iteration. But the barrier to entry has dropped to the point where you can genuinely learn by doing. Build a small model. See what it gets wrong. Improve your data or reward function. Train again. That cycle is now measured in hours, not weeks.

If you’re a technical team looking to build specialist AI capabilities on your own data, this is a solid starting point. And if you want help going from prototype to production, that’s what we do.