Policy Optimization Algorithms for Alignment

Steering language models toward safer outputs…

Note: This blog is an addendum to the blog. For more context, you can refer to my previous blog about “Gaming the System: Reward Hacking in Language-Model training”.

Policy Optimization Algorithms for Alignment

As we were discussing, Large language models (LMs) are incredibly powerful, mastering broad knowledge and reasoning skills through unsupervised training. However, achieving precise control over their behavior to ensure they align with desired characteristics, like helpfulness or safety, presents a significant challenge.

To align these models to generate responses as per desired human preferences, RLHF has become a common approach. Within the RL step of RLHF, methods from the family of policy gradient methods are frequently used to optimize the LM’s policy.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a prominent family of policy gradient methods developed for reinforcement learning.

After training a reward model \({r_\phi(x,y)}\) on human preference data, we treat the language model as a stochastic policy \(\pi_\theta(y \mid x)\) and seek to maximize its expected cumulative reward.

\[J(\theta) \;=\; \mathbb{E}_{x,y\sim\pi_\theta}\Big[\sum_{t=0}^T \gamma^t\,r_\phi(x_t,y_t)\Big]\]

PPO is typically run on batches of prompts, generating outputs and rewards, and updating the model gradually. At each iteration, PPO first samples a batch of rollouts \({(x_t,y_t)}\) under the current policy and computes the advantage estimates \(\hat{A}_t\)

\[\hat{A}_t \;=\; \sum_{t'=t}^T \gamma^{\,t'-t}r_\phi(x_{t'},y_{t'}) \;-\; V_w(x_t)\]

where \(V_w\) is a learned value function.

It then forms the probability ratio \(r_θ(x, y)\)

\[r_{\theta}(x,y) \;=\; \frac{\pi_\theta(y_t\!\mid\!x_t)}{\pi_{\theta_{\rm old}}(y_t\!\mid\!x_t)}\]

and optimizes the clipped objective,

\[L^{\text{CLIP}}(\theta) \;=\; -\mathbb{E}_{x,y \sim \pi_{\theta_{\text{old}}}}\Big[\min\big(r_{\theta}(x,y)\,\hat{A}_t,\;\mathrm{clip}(r_{\theta}(x,y),\,1-\epsilon,\,1+\epsilon)\,\hat{A}_t\big)\Big]\]

This clipped objective policy enforces an implicit “trust-region” of size \(\epsilon\) that the policy can be constrained to, preventing overly large updates. This can happen because this clipping ensures that the policy (LLM) doesn’t suddenly change its output distribution in a way that would ruin the language coherence or overshoot the reward maximum.

PPO’s clipped objective policy update makes it easier to fine-tune large models with fewer stability issues than earlier policy gradient methods (like TRPO). In alignment terms, PPO’s conservatism helps prevent extreme behaviors that would clearly indicate reward hacking.

However, PPO alone does not guarantee the absence of reward hacking – it just limits how fast the model can shift.

Therefore, implementations pair PPO with the KL divergence penalty between the new policy and the original pre-trained model \(\pi_{ref}\) (or SFT model) to keep outputs from deviating too much from human-like text. Intuitively, this KL penalty is a regularizer that “pulls back” the policy if it starts generating very unusual text just to boost the reward score (like gibberish but high reward text).

In sum, the RL objective for the LLM often becomes maximizing,

\[J(\theta) = \mathbb{E}_{x,y\sim \pi_\theta}[\,r_\phi(x,y)\,] - \beta.\,\textrm{KL}\big(\pi_\theta(\cdot|x)\;\|\;\pi_{\text{ref}}(\cdot|x)\big)\]

where \(\beta\) is a chosen penalty coefficient, and the KL term acts as an anchor to keep the policy’s behavior within the realm of the reference model. This approach was used in OpenAI’s InstructGPT and related work to successfully align models with human preferences.

Group Relative Policy Optimization (GRPO)

Made popular with the release of the cost-efficient training of the Deepseek-R1-Zero, the Group Relative Policy Optimization (GRPO) extends the RLHF paradigm by foregoing the separate value (critic) model used in standard PPO and instead using the group of \(N\) sampled outputs \({\{o1, o2, o3, …\}}\) for each prompt \(x\) to compute an on-the-fly baseline. This is how the breakdown for GRPO looks like:

We first compute the multi-component reward \(r_i\),

\[r_i = \lambda\,r_{\rm outcome}(o_i, x) + (1 - \lambda)\,r_{\rm process}(\tau_i, x)\]

where \(\mu_i\) is the model’s chain-of-thought trace (from the dataset) and λ balances final-answer correctness against reasoning quality. Then across the group, compute,

\[\mu = \frac{1}{N}\sum_{i=1}^N r_i, \quad \sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N (r_i - \mu)^2}\]

Then formed a normalized advantage term,

\[A_i = \frac{r_i - \mu}{\sigma}\]

The GRPO policy update then mirrors PPO’s clipped objective for each sample,

\[\rho_i = \frac{\pi_\theta(o_i \mid x)}{\pi_{\theta_{\rm old}}(o_i \mid x)}\]

to optimize the GRPO objective function

\[L^{\rm GRPO}(\theta) = \mathbb{E}_i\Big[\min\big(\rho_i\,A_i,\;\mathrm{clip}(\rho_i,\,1 - \epsilon,\,1 + \epsilon)\,A_i\big)\Big]\]

By rewarding the thought process alongside the outcome, GRPO operationalizes process supervision that can flag superficial solution hacks even when the final answer superficially satisfies correctness metrics.

GRPO’s flexible reward modeling was exemplified in DeepSeek-R1-Zero, where accuracy and format rewards guided chain-of-thought structure despite the absence of a separate critic network.

In practical text-to-SQL demonstrations, community tutorials have shown how GRPO can encourage models to explain table and join selections, rewarding schema linking and join reasoning in addition to SQL correctness, to catch silent errors early.

Direct Policy Optimization (DPO)

This is the newer, reinforcement-learning-free method for alignment that emerged as a response to PPO’s complexity. Introduced in Rafailov et al. (2023), DPO shows that it achieves comparable or better results than PPO-based RLHF on tasks like controlling sentiment or summarization quality.

DPO treats the preference-labeled data as a direct training signal for the final policy, rather than training a separate reward model for RL. It leverages the insight that a language model’s logits can themselves represent the reward function implicitly, resulting in a closed-form loss that encourages preferred outputs to outrank unpreferred ones.

The DPO algorithm reparameterizes the RLHF objective in closed form: given a preferred output \(y^+\) vs unpreferred output \(y^-\) for the same prompt, one can derive a simple loss on the LLM’s logits that will make \(y^+\) more likely than \(y^-\) by an amount corresponding to the “optimal” policy.

Concretely, given a batch of preference pairs \({\{y^+, y^-\}}\), where y+ is the human preferred output, for the same prompt x, you define the log-probability difference relative to a fixed reference policy \(\pi_{ref}\):

\[\Delta_\theta(x,y^+,y^-) = \bigl[\log\pi_\theta(y^+\!\mid\!x)-\log\pi_{\mathrm{ref}}(y^+\!\mid\!x)\bigr] -\bigl[\log\pi_\theta(y^-\!\mid\!x)-\log\pi_{\mathrm{ref}}(y^-\!\mid\!x)\bigr]\]

The DPO objective is then a simple logistic (sigmoid) loss on this difference:

\[L_{\mathrm{DPO}}(\theta) = -\mathbb{E}_{(x,y^+,y^-)}\!\Big[\log \sigma\!\bigl(\tfrac{1}{\tau}\,\Delta_\theta(x,y^+,y^-)\bigr)\Big]\]

where,

\[\sigma(z)=1/(1+e^{-z})\]

\(\tau>0\) is the temperature that calibrates the strength of preference.

In effect, DPO reduces to a weighted log-likelihood update on comparison data—no trust-region clipping, no actor-critic separation, no iterative RL loop with PPO hyperparameters. This makes DPO much simpler and more stable to implement.

By eliminating the explicit reward optimization loop, DPO can inherently avoid some reward hacking issues that arise during RL optimization. However, DPO still relies on the quality of the human preference dataset – if those preferences have biases (say, annotators unknowingly prefer longer answers), the LLM will still learn that bias. In other words, it sidesteps the dynamic gaming of a learned reward during training, but not the static biases in the preference data or model class. Nonetheless, DPO is a promising avenue for alignment.

With this extra context about the policy optimization algorithms, you can head back into the previous blog — “Gaming the System: Reward Hacking in Language-Model training.”