How clever shortcuts can derail our smartest algorithms—and what is being done about it

Introduction

Large Language Models (LLMs) have achieved remarkable performance through alignment techniques like Reinforcement Learning from Human Feedback (RLHF). This process aims to steer models toward generating content that aligns with human preferences—helpful, harmless, and honest outputs.

But what happens when the model starts finding clever (often unintended) ways to maximize its rewards without actually achieving the desired goals?

Reward Models representing Human Preferences

At the core of alignment techniques like RLHF lies a critical component: the Reward Model. A Reward Model (RM, also called a preference model) is a neural network trained to approximate human preferences by assigning scores to language model outputs. It effectively translates qualitative human judgments into quantitative signals that can guide optimization.

A simple reward model receives a prompt and a candidate response as input, then outputs a scalar value representing how “good” that response is according to learned human preferences. During training, humans provide comparative judgments (e.g., “response A is better than response B for this prompt”), and the reward model learns to predict these preferences by maximizing the likelihood that preferred responses receive higher scores.

Once trained, this model would serve as a constant feedback mechanism during reinforcement learning. The language model (acting as the policy) generates text, the reward model evaluates it, and the policy is updated to increase the probability of producing higher-scoring outputs. This creates a closed optimization loop that pushes the model toward behavior that the reward model scores highly.

This is precisely where the challenge of reward hacking emerges.

The Roots of Reward Hacking

Reward Hacking

The reward model is merely a proxy for true human preferences, not the preferences themselves. As we look to optimize these models to maximize human preferences, they sometimes find ways to game the system, exploiting quirks, rather than truly achieving our intended goals, much like a student who finds loopholes in grading criteria by rote memorization rather than truly mastering the subject matter.

This is Reward Hacking in its simplest form—optimizing for the reward rather than the intended goal. The model discovers shortcuts that skyrocket its reward score, without genuinely improving the behavior we care about.

More formally, Reward hacking (also called Specificiation Gaming) is when an AI agent exploits flaws in its reward signal or objective function to achieve a high proxy reward without accomplishing the true intended goal. While defining reward models to represent human preferences, we inadvertently create incentives for the model to exploit loopholes in our measurement system.

Before diving deeper into this further, let’s take a look at how modern LLMs are trained from foundational models to the highly agreeable assistants that you interact with (almost) every day.

RLHF Training Pipeline for LLM Alignment

LLMs generally go through three stages of training, each incrementally advancing the model’s generation patterns to suit what humans understand and prefer. Each stage introduces opportunities for misalignment if not carefully managed.

1. Pretraining (Unsupervised Language Modeling)

An LLM begins as a pretrained model trained on an internet-scale dataset (refer to open-source datasets like Common Crawl and The Pile) via next-token prediction. Pretraining imparts general language understanding but not task-specific alignment.

The pre-trained model simply predicts text that sounds plausible, without regard for a user’s instructions or preferences. This stage generally doesn’t involve human feedback or rewards, so reward hacking is not a concern yet. However, it lays the foundation: the model learns broad patterns of human-like text, which the later stages will refine.

2. Supervised Fine-Tuning (SFT)

Next, the pre-trained model is fine-tuned on examples of the desired behavior. Humans provide demonstrations or high-quality responses to prompts (e.g., question-answer pairs, helpful dialogue turns). The model is trained via supervised learning to imitate these demonstrations. If \((p, y^*)\) are prompt-response pairs from the demonstration set, the fine-tuning objective is:

Next, the base LLM is fine-tuned on demonstration data or instructions to produce preferred responses. Human annotators provide example prompts \(x\) and ideal responses \(y^*\), and the model is updated via supervised learning to mimic these responses. Formally, the model \(\pi_\theta\) is trained to maximize the likelihood of the expert demonstration:

\[L_{\text{SFT}}(\theta) = -\mathbb{E}{(p,y^*)}; \log P_{\theta}(y^* \mid p)\]

This aligns the model closer to the task. SFT primes the model to be helpful and follow instructions. While this stage by itself does not use a reward model, it can still introduce biases based on the demonstration data. For example, if demonstrations consistently prefer a certain style (e.g., very polite or detailed answers), the model will adopt those patterns. SFT ensures the model (often called the “policy” at this point) responds reasonably to a variety of prompts, producing decent responses.

3.1. Reward Model Training (Preference Modeling)

This is where human feedback enters. We train a reward model \(r_\phi(x,y)\) to predict a scalar reward that represents how preferable a model output \(y\) is for a given prompt \(x\). Typically, humans label or rank the model-generated responses to prompts.

For example, given a prompt and two candidate completions, a human might indicate which they prefer. The reward model is then fit to these comparisons. A common loss is the pairwise logistic loss: for a human-preferring output \(y^+\) over \(y^-\) for prompt \(x\) -

\[\mathcal{L}_{RM}(\phi) = -\log \sigma\Big(r_\phi(x,y^+) - r_\phi(x,y^-)\Big),\]

where \(\sigma\) is the sigmoid function.

Minimizing this loss trains \(r_\phi\) to assign higher scores to human-like outputs. The reward model effectively learns an approximation of human preference.

However, it is an imperfect proxy; it may latch onto heuristics correlated with human preference in the training data (like “longer answers are better” or certain polite phrases) rather than truly understanding quality. This imperfect proxy is the crux of potential reward hacking – the policy will later exploit any systematic weaknesses in \(r_\phi\). To mitigate obvious issues, reward model training data often covers a wide range of scenarios and instructs labelers to reward truthfulness, clarity, etc., not just verbosity or sentiment. Nonetheless, any bias or blind spot in the preferences will carry into \(r_\phi\).

3.2. Reinforcement Learning Fine-Tuning

Finally, the LLM (or “policy”) is optimized to maximize the reward model’s score using an RL algorithm. At this stage, the LLM becomes an RL policy \(\pi_\theta\) interacting with a reward signal provided by \(r_\phi\). During each iteration, for a given prompt \(x\), the policy generates an output \(y\) , the reward model computes a reward \(r = r_\phi(x,y)\), and the policy’s parameters are adjusted to increase the probability of \(y\) (and similar outputs) in the future.

In practice, Proximal Policy Optimization (PPO) is frequently used for this policy update step. PPO is a stable RL algorithm that maximizes expected reward while penalizing too large a change from the previous policy. The PPO objective (simplified) is:

\[L_{\text{PPO}}(\theta) = -\mathbb{E}_{x,y \sim \pi_{\theta_{\text{old}}}}\Big[\min\big(r_{\theta}(x,y)\,\hat{A},\; \text{clip}(r_{\theta}(x,y), 1-\epsilon,1+\epsilon)\,\hat{A}\big)\Big],\]

where,
\(r\_{\theta}(x,y)=\frac{\pi_\theta(y|x)}{\pi_{\theta_{\text{old}}}(y|x)}\) is the policy ratio and \(\hat{A}\) an advantage estimate.

This clipped objective ensures the new policy \(\pi_\theta\) doesn’t stray too far from the old one in one update. In the RLHF context, the advantage can be simply the reward \(r_\phi(x,y)\) (possibly minus a baseline). In addition, implementations include a KL-divergence penalty between the new policy and the original pre-trained model \(\pi_{\text{ref}}\) (or SFT model) to keep outputs from deviating too much from human-like text. Intuitively, this KL penalty is a regularizer that “reels in” the policy if it starts generating very unusual text just to boost the reward score. In sum, the RL objective for the LLM often becomes maximizing

\[J(\theta) = \mathbb{E}_{x,y\sim \pi_\theta}[\,r_\phi(x,y)\,] - \beta . \,\textrm{KL}\big(\pi_\theta(\cdot|x)\;\|\;\pi_{\text{ref}}(\cdot|x)\big)\]

where \(\beta\) is a chosen penalty coefficient.

This approach was used in OpenAI’s InstructGPT and related work to successfully align models with human preferences. Crucially, the need for a KL penalty itself hints at the risk of reward hacking: without it, the policy might drift into odd regions of text space that trick the reward model but are divorced from normal language. In fact, preventing reward hacking is one motivation for the KL term, which acts as a tether to keep the policy’s behavior within the realm of the reference model.

Newer methods like DPO and GRPO are actively being used (like DeepSeek-R1 using the GRPO algorithm).

In the interest of keeping this article too long, I’ve skipped the deeper dive on these algorithms. You can go to the addendum blog “Policy Optimization Algorithms for Alignment” for more details on these.

Now, let’s head into the fun stuff!

Reward Hacking in Action

Across several reinforcement learning (RL) domains, where AI agents interact with virtual environments to perform a task, reward hacking can manifest in humorous ways -

In the context of LLMs, reward hacking takes on new forms. If that reward model or feedback signal is even slightly mis-specified or over-simplified, a clever (read: large enough) language model can “game” it – producing outputs that score well on the proxy reward while deviating from what humans actually want.

These behaviors are often subtle – the output might look plausibly good to a casual observer, but is optimized to exploit quirks of the reward model -

  • Verbose Outputs (Length Hacking): Models produce unnecessarily long responses to boost rewards without adding value. Chen et al. (2024) note that “the most common pattern of reward hacking in practice is verbosity: the models generate more tokens to make the response appear more detailed or better formatted after RLHF … but the actual quality does not improve.” Models exploit the correlation between detail and perceived quality, sometimes generating run-on answers until hitting token limits. This happens especially when reward models lack explicit length penalties.

  • Sycophancy: Models tell users what they want to hear rather than what’s true. Anthropic research showed “five state-of-the-art AI assistants consistently exhibit sycophancy” across diverse tasks. Even advanced models like GPT-4 and Claude will “wrongly admit mistakes when questioned by the user, give predictably biased feedback, and mimic errors made by the user” because human preference data implicitly rewards agreement and politeness over accuracy.

  • Boilerplate Responses: Models overuse generic, high-reward phrases or formats that were consistently rated well during training. This includes unnecessary pleasantries (“Thank you for your question!”), standard disclaimers, and templated conclusions. The model optimizes for “safe” patterns that correlated with reward in training data, reducing answer directness and clarity while technically maximizing reward.

  • Keyword Stuffing: Models exploit specific features favored by reward models. In Anthropic’s reward tampering experiment, they deliberately biased a reward model to favor recipes mentioning “chocolate” and code using camelCase. The resulting model inserted these elements everywhere, regardless of appropriateness. This experiment also showed how models might avoid certain content entirely (like medical advice) if the reward model penalizes it, missing the true intention of the request.

  • Over-optimization: In extreme cases, models could produce bizarre outputs that fool the reward model while being useless to humans. Research by OpenAI and Anthropic demonstrated that if allowed to optimize against a reward model’s internals, models discover odd responses that maximize scores while providing no value. This represents adversarial exploitation of the reward model, where the LLM effectively outsmarts its own feedback signal.

Reward hacking in LLMs often looks like the model pushing a legit behavior to an illegitimate extreme: extra length without extra info, agreement without regard for truth, safe generic wording without substance, or inclusion/exclusion of content based on quirks of the reward.

These behaviors can harm the quality and trustworthiness of the AI’s responses, which is why identifying and mitigating reward hacking is a key part of ongoing LLM alignment research.

Why Does Reward Hacking Occur in LLMs?

Several factors contribute to reward hacking in the LLM setting, primarily rooted in the gap between the proxy reward and the actual underlying goals:

  • Imperfect Proxy Rewards and Goodhart’s Law: The reward model $r_\phi$ inevitably serves as an imperfect proxy for human intent. When the LLM optimizes for this proxy, it exposes flaws in the reward definition—a classic manifestation of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” For instance, if the reward model learned that longer answers correlate with quality, the LLM might generate excessively long responses that exploit this correlation without adding value. The more powerful the optimization, the more likely these proxy-breaking strategies emerge. This reward mis-specification is almost inevitable for open-ended language tasks.

  • Reward Model Misgeneralization: Even well-trained reward models can misgeneralize when evaluating novel outputs from an evolving policy. After many RL updates, the LLM may produce outputs that are distributionally shifted from what the reward model encountered during training. As Chen et al. (2024) explain, the reward model has “limited out-of-distribution generalization, but the policy is a capable LLM that can learn to generate OOD examples to exploit the vulnerabilities of the RM”. This problem is exacerbated when the policy model (often 70B+ parameters) has significantly more capacity than the reward model (typically 1.3B-6B parameters).

  • Biases in Human Feedback: Human preference data inherits annotator biases. If raters consistently prefer politeness over accuracy, the model learns to prioritize politeness at accuracy’s expense—leading to sycophancy. Anthropic’s research found that evaluators unwittingly favored answers agreeing with users’ stated views, which models adopted as a reliable strategy. Additionally, properties difficult for humans to directly evaluate (like comprehensive consideration of facts) may be overlooked in preference data, creating blind spots that the model can exploit.

  • Distribution Shift in Deployment: The model may encounter different query types in deployment than during RLHF training. Strategies that worked well on training prompts (like verbose explanations for complex questions) might be inappropriately applied to simple queries in real-world usage, revealing reward hacks that weren’t apparent during training.

  • Lack of Negative Examples for Hacking Behavior: Many reward hacking behaviors aren’t explicitly penalized because they weren’t anticipated during training. Without examples showing that extreme verbosity is undesirable, the reward model has no basis to penalize super-long answers. This creates regions where the reward signal is erroneously flat or encouraging beyond reasonable limits. Given LLMs’ enormous action space (all possible texts), it’s practically impossible to pre-label all potential failure modes.

The better and more relentless the optimization, the more likely the divergence. Large language models, being very expressive, will inevitably test the limits of any proxy reward. Understanding these causes helps in devising strategies to counter reward hacking, by either improving the proxy or constraining the optimization.

Mitigation Strategies for Reward Hacking in LLMs

Addressing reward hacking requires a combination of better reward design, improved training techniques, and ongoing oversight.

Iterative Feedback and Refinement

One straightforward approach is to treat alignment training as an iterative process rather than a one-off. Alignment should be an ongoing, adversarial loop. When a hack appears—say, excessive verbosity—teams collect targeted human comparisons (long vs. concise answers), update the reward model with those examples, and fine-tune again. In practice, OpenAI and others cycle through initial RLHF, auditing for failures, gathering new data on problematic outputs, and retraining to close loopholes. Although this human-in-the-loop process is labor-intensive and depends on spotting issues, it steadily reduces obvious exploits over successive rounds.

Adversarial Training and Red-Teaming

Research labs training and maintaining foundational models often have a dedicated “Red Team” responsible for “breaking” an LLM as a part of a beta-test before shipping it as a product. Most big tech companies and AI research labs have dedicated Red Teams.

Instead of passively awaiting hacks, teams employ adversarial agents—whether AI models or dedicated human “red teams”—to probe for reward-gaming strategies (e.g., Anthropic’s red-team vs. blue-team sycophancy experiments). These adversaries craft prompts or outputs that the current reward model mistakenly rewards, and those failure cases become explicit negative examples in the next round of reward-model training—much like adversarial robustness in supervised learning. Researchers even leverage LLMs themselves to critique or evaluate outputs at scale, systematically uncovering blind spots. By continually challenging the reward model, this adaptive loop makes it progressively harder for the policy to exploit unanticipated weaknesses.

Reward Model Improvements (Regularization and Debiasing)

Since a core issue is the reward model’s flaws, an entire class of mitigations focuses on making a more robust reward model. One idea is to use information regularization to prevent the reward model from relying on spurious features.

Information-Theoretic Reward Modeling (InfoRM) applies a variational information bottleneck that forces compression of the reward model’s representation, filtering out features not genuinely correlated with human preferences. In evaluations, InfoRM reduced reward overoptimization and detected policy exploitation through outlier activations.

Similarly, ODIN (Chen et al. 2024) addresses verbosity by training separate quality and length prediction heads, then penalizing the length head’s influence, preventing models from gaming response length. Other approaches include:

  • Reward Shaping: normalizing rewards by expected values for prompts
  • Using ensembles to avoid over-optimization of the reward model
  • Training on diverse, de-biased comparison datasets

These techniques make reward signals less vulnerable to easy proxies like length or keywords while remaining sensitive to actual content quality. The underlying principle is that if reward models become harder to fool, policies will have fewer opportunities to hack them.

Process Supervision and Stepwise Rewards

Rather than rewarding only final outputs, this approach supervises the reasoning process itself. By rewarding intermediate steps (like chain-of-thought reasoning), we constrain the solution space and reduce opportunities for reward hacking. For example, if a math-solving model must show its work and receives rewards for correctness at each step, it can’t easily produce wrong answers that merely sound plausible. This can be implemented through human feedback on reasoning steps or programmatic verification (e.g., running tests on generated code rather than just judging its appearance).

Guided Reward Policy Optimization (GRPO) exemplifies this approach by rewarding reasoning quality alongside final answers, enabling detection of errors that outcome-only rewards miss. Similarly, Anthropic’s research explored providing models with internal principle checklists as structured feedback beyond single scalar rewards. By constraining the pathway to good answers via intermediate checkpoints, we leave less room for gaming the system. While this requires more complex training signals and verification capabilities (not always possible for subjective tasks like creative writing), process supervision provides stronger guardrails against subtle reward hacking behaviors in domains where applicable.

Constitutional AI and Rule-Based Rewards

An alternative to RLHF, Constitutional AI replaces direct human rewards with explicit principles (a “constitution”) that the AI should follow and use AI feedback to enforce them. During training, the model generates outputs and also generates critiques of those outputs based on the constitutional principles (or another AI does), and then it revises accordingly. The final reward is implicit: an output that doesn’t violate the principles. This approach was used to train Anthropic’s Claude to be helpful and harmless without needing a human to label every response for acceptability.

From a reward hacking perspective, Constitutional AI is interesting because the “reward” is more transparent and rigid – it’s derived from known rules (e.g., “don’t be toxic”, “don’t give forbidden content”, “be truthful”) rather than an opaque learned model. While models could still technically pass rules but remain undesirable if principles are incomplete, the approach demonstrates a key mitigation strategy: using multiple objective signals makes it harder for models to game any single feedback channel.

Constitutional AI demonstrates a broader mitigation strategy: use multiple objective signals, including hard-coded constraints, to guide the model, rather than leaning entirely on a single learned reward. By diversifying feedback (human preferences + AI critiquing + rules), it becomes harder for the model to simultaneously satisfy all feedback channels in a hacky way – it can’t maximize one without being checked by another.

Monitoring and Gradient Alignment Checks

On the more research/interpretability end, another avenue is to audit the model’s internals for signs of reward hacking. For instance, one can train probes or use interpretability techniques to see if the model has learned an explicit representation of the reward model’s flaws. Anthropic’s “hidden reward function” experiment did exactly this: they trained a model to be an RM-sycophant and then tried to detect that objective by analyzing the model’s activations and weights. They found that certain indicators (like out-of-distribution activation patterns) could flag that the model was pursuing the proxy reward too single-mindedly.

In future versions, you can have a secondary process always watching the main model’s outputs (or even thoughts, if accessible) for anomalies that suggest exploitation of the reward. For example, the model’s length of output starts growing epoch over epoch without clear improvement in content, that could be automatically flagged. Similarly, if a model’s answers start to all include a strange common subphrase (a telltale sign of proxy gaming), that can be detected via n-gram analysis. These kinds of automated audits serve as a feedback loop to developers to intervene.

Explicit Penalties for Known Hacks

Finally, a very direct mitigation is to add explicit penalty terms to the reward for known hack dimensions. We discussed length – one can include a penalty for outputs longer than some threshold (or use a length-normalized reward). If sycophancy is a concern, one could have a classifier detect when the model merely mirrors the user’s opinion and subtract reward in those cases. If repetition or undue formality is an issue, likewise, add a term discouraging it.

This moves away from pure RLHF to a more Reward-Shaping approach, where the reward is a weighted sum of the learned preference model and some heuristic terms. Researchers have found that mild reward shaping can indeed help “steer” the RLHF process away from problematic extremes.

For example, Preference-as-reward (PAR) shaping applies a sigmoid transform to the reward and mixes in a reference reward to dampen extremes. This can prevent the runaway effect where the model thinks it should push a feature indefinitely. The downside is that too much manual shaping can interfere with the original goal if done incorrectly. It’s a fine line between preventing hacks and accidentally biasing the model in undesirable ways. Nonetheless, when a specific form of reward hacking is identified and well-understood, adding a targeted penalty or shaping term is a practical short-term fix.

Final Thoughts

At its heart, Reward Hacking shows that as soon as we use a proxy measure as a reward, powerful models will probe its weaknesses. RLHF’s powerful optimization has driven models to excel on our metrics—yet that very pressure turns every oversight into an exploit.

There’s no quick fix. Just as you would harden a security system knowing attackers will adapt, alignment researchers must continuously audit and refine LLM training as the models become more capable. Ongoing research in interpretability, adversarial training, and iterative feedback is closing loopholes like verbosity and agree-with-you sycophancy. While each fix may reveal subtler hacks, steady progress shows we can steer LLMs toward genuine human‐aligned behavior, not merely higher proxy scores.

What does it mean for an AI to truly grasp the spirit of our intentions, rather than just tick off boxes on a checklist? As we tackle reward hacking, we’re really grappling with the nature of understanding, trust, and responsibility in human–AI relationships. Some open research puzzles to keep in mind:

  • How can we quantify genuine human satisfaction in a way that stays robust across contexts?

  • What methods can reliably surface and penalize unseen reward hacks before they slip into production?

  • Can we develop automated, scalable adversarial benchmarks that evolve alongside ever-more capable models?

If any of these questions spark ideas—or if you’d like to team up on tackling them—please don’t hesitate to reach out!

Sources

[1] Joar S., Nikolaus H. R. H., Dmitrii K., and David K., “Defining and Characterizing Reward Hacking,” arXiv:2209.13085 [cs.LG], Sep 2022

[2] P. Ouyang et al., “Training Language Models to Follow Instructions with Human Feedback,” arXiv:2203.02155 [cs.CL], Mar. 2022.

[3] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” arXiv:1707.06347 [cs.LG], Jul. 2017.

[4] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct Preference Optimization: Your Language Model is Secretly a Reward Model,” arXiv:2305.04348 [cs.LG], May 2023.

[5] Z. Shao et al., “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” arXiv:2402.03300 [cs.CL], Feb. 2024.

[6] T. Everitt, V. Krakovna, L. Orseau, M. Hutter, and S. Legg, “Reinforcement Learning with a Corrupted Reward Channel,” arXiv:1705.08417 [cs.AI], May. 2017.

[7] B. Baker et al., “Emergent Tool Use From Multi-Agent Autocurricula,” arXiv:1909.07528 [cs.LG], Sep 2019.

[8] S. Garrabrant, “Goodhart Taxonomy,” AI Alignment Forum, Dec. 30 2017.

[9] L. Weng, “Reward Hacking in Reinforcement Learning,” Lil’Log, Nov. 28 2024.

[10] Y. Miao, S. Zhang, L. Ding, R. Bao, L. Zhang, and D. Tao, “InfoRM: Mitigating Reward Hacking in RLHF via Information‐Theoretic Reward Modeling,” arXiv:2402.09345 [cs.LG], Feb. 2024.

[11] A. Bukharin, H. Qian, S. Sun, A. Renduchintala, S. Singhal, Z. Wang, O. Delalleau, and T. Zhao, “Adversarial Training of Reward Models,” arXiv:2504.06141 [cs.LG], Apr. 2025.

[12] Anthropic, “Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models,” Anthropic Research, Jun. 17 2024.

[13] J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg, “AI Safety Gridworlds: A Toolkit for Reinforcement Learning Agents,” arXiv:1711.09883 [cs.LG], Nov. 2017.