The Reward Horizon Problem

May 31, 2026

When we train models, we use rewards observed at short horizons: Did the code run? Did the user click on the product? Did the email get a reply?

When we use models, we care about objectives realized at longer horizons: Was the code valuable? Did the user like the product? Did the email produce a collaboration?

As we start using models for more ambitious work, this horizon mismatch between rewards and economic objectives will matter more.

Diagram of reward horizon problem. Models are trained on short-term rewards, but used to satisfy long-term objectives.

Two challenges with incorporating long-horizon objectives into reward models are:

Unobserved objectives: When the objective is observed with a delay, a reward model tuned to short-horizon outcomes may miss information about the ultimate goal.
Underdetermined attribution: When the objective cannot be cleanly attributed to a specific decision, a reward model may assign credit to the wrong action.

Challenge 1: Objective is Observed With a Delay

Suppose we would like to reward a model to produce stable code. Many stability issues are not observed at code generation time.

Here’s how long it can take a merged pull request to be reverted or corrected:

Time from merged pytest pull requests to later reversions or corrections

The data come from pytest-dev/pytest, where PR descriptions often attribute corrections of prior commits. I run a simple classifier on mentions of previous PRs and commits to find examples of reversions and corrections. To avoid right censoring, the denominator is all PRs merged May 29, 2025 or earlier.

Here’s an example correction:

bluetech commented on Mar 14, 2022#9768

Regressed in fac8f28, didn’t notice since we don’t run tests in CI with -v.

The original bug took weeks to observe, at least through human review. A reward model targeted to code that passes tests at merge time may have treated the commit as stable.

We may be able to avoid some of these issues by pushing models to write different tests and consider more dependencies during code generation. In a loose sense, some of these mistakes may live “in the box” at generation time, and may not require subsequent usage to diagnose. In this case, long-horizon objectives could be more readily distilled to update the distribution of short-horizon rewards.

Some issues, however, may not be surfaced through introspection. They require delayed feedback from the world.

Challenge 2: Attribution is Underdetermined

Suppose we would like to reward a model to produce software that receives more downloads. Such an objective may not be easily traceable to a prior code change.

Consider the following plot of pytest package downloads and major/minor releases.

Pytest package downloads with major and minor releases over time

I don’t mean to suggest that downloads are the only post-training objective, or even the right objective. But even for this naive example, some issues come to light:

Confounding: A naive read of download outcomes in this plot would attribute the large, end-of-sample increase in pytest downloads to the late 2025 package release.

More plausibly, the increase in pytest downloads was a product of other events that happened during this period—including the growth of coding agent usage (Sarkar and Melas-Kyriazi, 2026) and a rapid increase in software output (Daigle, 2026).
Post-training exploration strategies produce exogenous rollout variation that guides policy updates. But it’s expensive and often infeasible to have many rollouts over the same prompt running over months to measure longer-horizon outcomes. When we rely on natural variation in rollouts, there is less certainty over whether a given choice contributed to these outcomes.

Credit assignment and temporal attribution: The change in user demand from a given software update depends on a complex combination of decisions.

Within each release, there are many PRs, coding choices, and upstream decisions that are hard to trace to more downloads. How do we consider the relative value of a given new feature, bugfix, or documentation change?
Temporal attribution requires assumptions about dependence (how do we consider a small fix in an earlier release that prevented a large bug in a later release?) and delays (when do end users realize that the new software has more value?).

Reward Horizons Versus Trajectory Horizons

There is a rich body of work on the trajectory horizon problem, where it is difficult to propagate information and credit through a long trajectory (e.g., Sutton, 1984; Arumugam et al., 2021; Park et al., 2025).

The challenges raised in this post relate to the reward horizon problem, where we do not observe the objective we care about at generation time (e.g., Joulani et al., 2013; Vernade et al., 2017; Pike-Burke et al., 2018). This problem has been studied in work on sycophancy (OpenAI, 2025), advertising (Chapelle, 2014), recommender systems (Mann et al., 2019; Kleinberg et al., 2022; Pan et al., 2024; Zhang et al., 2025), and social media (Cunningham, 2023). Even for short rollouts, the objective may be observed with a delay.

I don’t mean to imply these are completely separate issues. One could recast the reward horizon problem as a trajectory horizon problem over more abstract episodes, with additional confounding from the environment. But it may be useful to consider specific challenges we encounter when the horizons of post-training rewards and economic objective differ.

Ways Forward

Here are a few potential strategies to address the reward horizon problem:

Surrogate/Critic Models: One strategy is to distill long-horizon signals into surrogate or critic models (e.g., Ng et al., 1999; Mann et al., 2019; Athey et al., 2025). For outcomes that can be determined via information available at generation time, like some types of bugs, this may be a fruitful path. We can train classifiers of future outcomes and introduce these classifiers into post-training rewards. In the bug fixing example, such reward modifications may push models to reason more about cross-codebase dependencies that a generated code change may break. Some challenges to this strategy are the potential for limited coverage and reward hacking to the proxy (e.g., Karwowski et al., 2023).

For example,

Identify PRs that passed at merge time but were corrected weeks later
Identify a matched sample of semantically related PRs that were not corrected later
Contrastively learn a rubric, or develop another kind of critic model, on this data
Modify the reward to account for this learned model
Measure how generated outputs change

Conditioning on Privileged Information: A related strategy is to condition a teacher model on privileged information about realized long-horizon objectives (e.g., Vapnik and Izmailov, 2015; Penaloza et al., 2026). Such distillation may produce richer feedback for models, but may risk pushing toward example-specific model behavior.

Attribution Infrastructure: Another strategy is to change record-keeping practices to make model generations more attributable. By storing past instructions, trajectories, and outputs, future changes may more cleanly refer to past decisions. Such data may provide useful training signal (e.g., Liao et al., 2021).

Exogeneity: Focusing on exogenous variation may address the confounding issue. Long-running A/B tests on model candidates may help to evaluate how well models satisfy longer-term objectives. Plausibly exogenous variation through natural experiments may guide evaluation and training.

Relation to Outcome-Based Pricing

There has been growing interest in outcome-based pricing (or value-based pricing) for AI outputs.

An analogue of the reward horizon problem applies here. For outcomes observed in the short run, like resolved customer support tickets, such pricing strategies may be tenable. But for longer-run objectives, it is less clear how to scope and reward diverse outputs from a general-purpose input like AI.

Another challenge to outcome-based pricing is adverse selection. Suppose the world has both token-based and outcome-based pricing. An adopting firm may know more about the effort required to complete certain pieces of work than an outcome-based model provider does. As long as the firm retains an information edge, it may disproportionately outsource higher-effort work to the provider.

Upshot

It is natural to use post-training rewards observed at short horizons. They facilitate faster training, may be less ambiguous, and can be cheaper to produce. However, these rewards may miss information about objectives that take longer to realize.

As we start to use models for more ambitious work, we implicitly delegate more judgment calls. These delegated judgments may influence the objectives we care about. The horizon mismatch between rewards and objectives calls for changes in how models are evaluated and trained.

I thank Dennis Lee, Annie Liang, and Mirac Suzgun for helpful discussions. Comments are very welcome. Get in touch.