The Reward Horizon Problem

May 31, 2026

When we train models, we use rewards observed at short horizons: Did the code run? Did the user click on the product? Did the email get a response?

When we use models, we care about objectives realized at longer horizons: Was the code valuable? Did the user like the product? Did the email produce a collaboration?

As we start using models for more ambitious work, this horizon mismatch between rewards and objectives will matter more. To make models more economically useful, we will need to do more to address the reward horizon problem.

Two Examples of Horizon Mismatch

Observability: Application to Code Stability

Here’s an example of how long it can take a merged PR to be reverted or corrected.

Time from merged pytest pull requests to later reversions or corrections

The data come from pytest-dev/pytest, which often attributes corrections in PR descriptions. I run a simple classifier on mentions of previous PRs and commits to find examples of reversions and corrections of previous PRs. To avoid right censoring, the denominator is all PRs merged May 29, 2025 or earlier.

Here’s an example correction:

bluetech commented on Mar 14, 2022#9768

Regressed in fac8f28, didn’t notice since we don’t run tests in CI with -v.

This regression required weeks to observe and fix, at least through human review. A reward model targeted to code that passes tests at merge time may have treated the commit as stable.

We may be able to avoid some of these issues by pushing models to write different tests and consider more dependencies during code generation. In a loose sense, some of these mistakes may live “in the box” at generation time, and may not require subsequent usage to diagnose. In this case, long-horizon objectives could be more readily distilled to update the distribution of short-horizon rewards.

Some issues, however, may not be surfaced through introspection. They require feedback from the world.

Attribution: Application to Product Demand

Consider the following plot of pytest package downloads and major/minor releases.

Pytest package downloads with major and minor releases over time

I don’t mean to suggest that downloads are the only post-training objective, or even the right objective. But even for this naive example, some issues come to light:

Confounding: A naive read of download outcomes in this plot would attribute the large, end-of-sample increase in pytest downloads to the late 2025 package release.

Credit assignment and temporal attribution: Suppose we want models to produce software that generates more downloads.

Ways Forward

There is a rich body of work on the trajectory horizon problem, where it is difficult to propagate information and credit through a long trajectory (e.g., Sutton, 1984; Arumugam et al., 2021; Park et al., 2025).

The challenges raised in this post relate to the reward horizon problem, where we do not observe the objective we care about at generation time (e.g., Joulani et al., 2013). This problem has been studied in work on sycophancy (OpenAI, 2025), advertising (Chapelle, 2014), recommender systems (Mann et al., 2019; Kleinberg et al., 2022; Pan et al., 2024; Zhang et al., 2025), and social media (Cunningham, 2023). Even for short rollouts, the objective may be observed with a delay.

Comparison of trajectory horizon and reward horizon problems

I don’t mean to imply these are completely separate issues. One could recast the reward horizon problem as a trajectory horizon problem over more abstract episodes, with more confounding from the environment. But it may be useful to think more about specific challenges we encounter when reward and objective horizons differ.

Here are a few potential strategies to address the reward horizon problem:

Surrogate/Critic Models: One strategy is to distill long-horizon signals into surrogate or critic models (e.g., Ng et al., 1999; Mann et al., 2019; Athey et al., 2025). For outcomes that can be determined via information available at generation time, like some types of bugs, this may be a fruitful path. We can train classifiers of future outcomes and introduce these classifiers into training rewards. In the bug fixing example, such reward modifications may push models to explore more across cross-codebase dependencies that may break from a generated code change. Some challenges to this strategy are the potential for limited coverage and reward hacking to the proxy (e.g., Karwowski et al., 2023).

Conditioning on Privileged Information: A related strategy is to condition a teacher model on privileged information about realized long-horizon objectives (e.g., Vapnik and Izmailov, 2015; Penaloza et al., 2026). Such distillation may produce richer feedback for models, but may risk pushing toward example-specific model behavior.

Attribution Infrastructure: Another strategy is to change record-keeping practices to make model generations more attributable. By storing past instructions, trajectories, and outputs, future changes may more cleanly refer to past decisions. Such data may provide more useful training signal.

Exogeneity: Focusing on exogenous variation may address the confounding issue. One way forward is through experimentation: Long-running A/B tests on model candidates may help to evaluate how well models satisfy longer-term objectives. Another way forward is to find plausibly exogenous variation through natural experiments that may guide evaluation and training.

Relation to Outcome-Based Pricing

There has been growing interest in outcome-based pricing (or value-based pricing) for AI outputs.

An analogue of the reward horizon problem applies here. For outcomes observed in the short run, like resolved customer support tickets, such pricing strategies may be tenable. But for longer-run objectives, it is less clear how to scope and reward diverse outputs from a general-purpose input like AI.

Another challenge to outcome-based pricing is adverse selection. Suppose the world has both token-based and outcome-based pricing. An adopting firm may know more about the effort required to complete certain pieces of work than an outcome-based model provider does. As long as the firm retains an information edge, it may disproportionately outsource higher-effort work to the provider.

Upshot

It is natural to use post-training rewards observed at short horizons. They facilitate faster training, may be less ambiguous, and can be cheaper to produce. However, these rewards may miss information about objectives that take longer to realize.

As we start to use models for more ambitious work, we implicitly delegate more judgment calls. These delegated judgments may influence the long-horizon objectives we care about. The horizon mismatch between rewards and objectives calls for changes in how models are evaluated and trained.

Comments are very welcome! Get in touch.