Chinese AI company DeepSeek made headlines again this year when it adopted GRPO, a form of reinforcement learning (RL) to lessen human involvement in training, slash bottlenecks, and still build powerful models. But what is RL?
Think Pavlov’s dogs, but apply it to LLMs and take away the humans. The Russian psychologist conditioned dogs to salivate by pairing the sound of a bell with the presentation of food. Similarly, AI agents learn to make decisions by interacting with an environment to maximise rewards—the reward being feedback signals that indicate success.
This blog breaks down different RL approaches, distinguishes between pre-training and post-training, then explains Reinforcement Learning from Human Feedback (RLHF), referencing PPO, GAE, the critic (the value function) and GRPO.
FLock’s research team is exploring trends in a series of educational blogs to help our community stay ahead. Our last explainer was on model distillation and data synthesis – stay tuned for more!
Key terms in RL
Here’s a key terms glossary to get your head around first:
- Agent: The decision-maker (in this case, the LLM).
- Environment: The world the agent interacts with and gets feedback from.
- Actions: Choices the agent makes.
- Rewards: Feedback signals indicating success.
- Policy: The strategy the agent follows to decide its actions.
- PPO: Proximal Policy Optimisation
Different RL approaches
There are a few RL algorithms that help AI learn effectively:
1. Q-Learning
Q-Learning is like maintaining a scoreboard for each action. It learns the expected reward for taking a specific action in a given state. The formula is straightforward but powerful:
(α: learning rate, γ: future reward weight)
2. Deep Q Networks (DQNs)
DQNs extend Q-Learning using neural networks. Instead of a simple table, we have a complex model to handle vast data, making it possible to apply RL to tasks like mastering Atari games.
3. Policy Gradients and PPO
Policy gradients optimize the policy directly, rather than the value function. PPO (Proximal Policy Optimization) is a more stable, efficient version that adjusts the policy while avoiding drastic changes.
Two phases of LLMs learning: pre- and post-training
LLMs undergo two key phases in their development: pre-training and post-training.
Phase 1: Learning the basics (pre-training)
Pre-training is the first phase in developing a large language model (LLM), where it learns from trillions of tokens to build a broad understanding of language and knowledge. This process is hugely computationally expensive, often costing millions of dollars, as it requires vast datasets and specialised hardware.
While just the starting point, pre-training is essential for shaping an AI’s capabilities before fine-tuning and reinforcement learning refine its performance.
Phase 2: Refinement through feedback (post-training)
Once the model has absorbed language patterns, it needs to be refined. Here’s where two key techniques come in:
Supervised fine-tuning (SFT):
Think of this as private tutoring. Experts provide high-quality responses, and the model is trained to mimic them.
This is the initial stage, where the model is trained to predict the next token in a sequence using vast amounts of carefully selected text data. By processing extensive datasets, the model learns linguistic structures, patterns, and general knowledge. By the end of this stage, the model is trained to replicate expert-like responses.
This is obviously the ideal way to learn if we had unlimited high quality, expert data. But usually we don’t. Enter RLHF.
Reinforcement Learning from Human Feedback (RLHF):
Since expert data is limited, the model is further trained using human ratings to improve its responses.
Reinforcement Learning from Human Feedback (RLHF)
Limited expert data? This is where Reinforcement Learning from Human Feedback (RLHF) steps in. Here’s how it works:
- The model generates multiple responses for a prompt.
- Humans rank the responses based on quality.
- A reward model learns to predict these rankings.
- Reinforcement Learning (e.g., PPO) fine-tunes the model to improve future outputs.
By integrating RLHF, LLMs become more aligned with human preferences, producing nuanced, high-quality responses that make them more reliable for complex tasks.
The reward model teaches good judgment
Just like a student needs feedback, LLMs rely on a reward model to evaluate their answers. However, instead of grading every single response manually (which would take forever), a smaller dataset of human ratings is used to train a model that can predict human preferences at scale.
The reward for a partial response is always 0; only complete responses receive a non-zero score from the reward model.
PPO refines LLMs through trial, error and feedback
After the LLM has been pre-trained and fine-tuned on curated data, it still requires further refinement.
Proximal Policy Optimisation (PPO) is an RL technique that helps LLMs improve through trial, error, and feedback.
Unlike traditional reinforcement learning methods, PPO is designed to be both stable and efficient. Instead of making drastic changes to the model’s behaviour after each update, PPO applies small, controlled adjustments to avoid overcorrection and maintain performance.
How PPO works
PPO operates in a loop designed to iteratively enhance the model’s responses:
- Generating responses
The LLM produces multiple answers to a given prompt.
- Scoring with the reward model
A separate reward model evaluates these responses, ranking them based on their quality and alignment with human preferences.
- Adjusting based on advantage estimation
Using General Advantage Estimation (GAE), the model determines how much better or worse a response is compared to the average.
- Policy optimisation
The LLM updates itself by reinforcing high-quality responses and discouraging weaker ones.
- Critic updates
A secondary model, known as the critic, improves its ability to predict future rewards, ensuring more stable learning.
General Advantage Estimation (GAE) balances bias and variance
It’s not enough to know whether a response is good or bad—we need to measure how much better or worse it is compared to other possible responses. This is where General Advantage Estimation (GAE) comes in.
There are two main ways of estimating this advantage, each with their trade-offs.
Monte Carlo estimation (MC) – Looks at the full response and evaluates its overall reward. This method is accurate but requires a lot of computation and is slow to adapt.
Temporal Difference (TD) estimation – Evaluates responses at a token-by-token level. This allows for quicker updates but introduces bias since it predicts rewards before seeing the full response.
GAE acts as a middle ground, balancing the accuracy of Monte Carlo (MC) with the efficiency of Temporal Difference (TD) by using a mix of short- and long-term reward predictions.
The critic guides AI with value estimation
The critic, also known as the value function, in RL helps the AI determine not just whether a response is good or bad, but how rewarding it is likely to be in the long run. It is essential for improving learning stability in PPO.
How it works:
Predicting future rewards
The critic estimates how good a response will be before the final reward is assigned.
Smoothing learning
Instead of relying on delayed feedback, the critic provides a continuous signal, helping the model adjust more efficiently.
Reducing variance
By offering a more stable measure of expected reward, the critic prevents erratic updates and overcorrections.
In PPO, the critic:
- Enables General Advantage Estimation (GAE) by providing a baseline for comparing different responses.
- Guides the model to make better predictions rather than blindly chasing reward signals.
- Helps PPO apply small, controlled updates, ensuring the model doesn’t change too drastically in response to individual training examples.
PPO - Putting it all together
Given the three terms above, in addition to the value function MSE loss (which is optimized along with the LLM), the PPO objective is defined as follows:
L_PPO(theta, gamma) = L_clip(θ) + w1 * H(θ) - w2 * KL(θ) - w3 * L(γ)
- L_clip(θ): maximise reward
- H(θ): maximise entropy
- KL(θ): penalise KL divergence
- L(γ): critic L2
A summary of the different terms in this objective is as follows:
L_clip(θ) - Maximize rewards for high-advantage actions (clipped to avoid instability).
H(θ) - Maximize entropy to encourage exploration.
KL(θ) - Penalize deviations from the reference policy (stability).
L(γ) - Minimize error in value predictions (critic L2 loss).
GRPO
GRPO (Group Relative Policy Optimization) is DeepSeek AI’s optimized take on PPO, built to be more efficient—especially when tackling complex reasoning problems.
What Makes GRPO Different?
Think of GRPO as PPO’s lightweight sibling. It retains the PPO framework but eliminates the need for a separate value function (critic), making training faster and more streamlined.
The GRPO Innovation — Group-Based Advantage Estimation (GRAE):
Instead of relying on a critic to estimate the “advantage” of a response, GRPO compares responses within a group. For each prompt, it generates multiple LLM outputs, then evaluates how each performs relative to the others using reward scores.
Simplified GRPO Training Process:
- Generate a Response Group: For each prompt, produce a batch of responses from the model.
- Score Responses: Use a reward model to score each one.
- Compute Group-Relative Advantages (GRAE): Determine how each response compares to the group average, normalizing scores to get relative advantages.
- Policy Optimization: Update the model using a PPO-style loss, guided by these group-relative advantages instead of a traditional value estimate.
About FLock
FLock.io is a community-driven platform facilitating the creation of private, on-chain AI models. By combining federated learning with blockchain technology, FLock offers a secure and collaborative environment for model training, ensuring data privacy and transparency. FLock’s ecosystem supports a diverse range of participants, including data providers, task creators, and AI developers, incentivising engagement through its native FLOCK token.