LLM-Guided Future Hypotheses for Horizon-Aware Exploration in Multi-Step Robot Manipulation

Mohammad Khoshnazar ยท Andrew Melnik
University of Bremen ยท Institute of Artificial Intelligence
We study whether short-horizon future videos can help robot manipulation policies during execution and RL fine-tuning when the future available at test time is generated, imperfect, or temporally shifted.

Abstract

Multi-step robot manipulation requires acting under uncertainty about how the scene will evolve. Tasks such as opening drawers, moving sliders, turning switches, and pushing objects into containers depend on contact timing, articulated object motion, and delayed task progress. We study whether short-horizon future videos can provide useful conditioning signals for closed-loop robot control and reinforcement-learning fine-tuning. We formulate this setting as Future-Experience Conditioning (FEC), where a policy receives recent observations together with a compact latent representation of a short future video. During training, this future signal can be taken from demonstrations. At inference time, ground-truth futures are not available, so we evaluate generated futures and intentionally mismatched futures to study the effect of imperfect future supervision. Our experiments are conducted in simulation on RoboCasa and CALVIN across multiple manipulation tasks. We compare behavior cloning, behavior cloning with RL fine-tuning, and a future-conditioned policy baseline under no-future, ground-truth future, generated future, and wrong-future conditions. The results show that task-consistent future conditioning can improve policy execution and RL adaptation in our simulated setting, while mismatched futures can strongly hurt performance. We do not claim real-world transfer or end-to-end digital-twin construction from raw sensory input; instead, this work focuses on the role of short-horizon visual future signals as structured priors for manipulation policies under controlled future-source mismatch.

Step-by-step future-conditioning pipeline

๐Ÿ“ Command + Obs current RGB + task text input at time t ๐Ÿง  Task Grounding target object / part desired state change ๐ŸชŸ Future Rollout short-horizon object motion cue ๐ŸŽฌ Future Video generated visual future signal ๐Ÿ“ฆ Future Tokens encode + temporal bins project to future latent ๐Ÿ›ก๏ธ RAFC Gate null + shifted futures reduce timing sensitivity ๐Ÿค– Policy recent observations + reliable future latent + closed-loop action
1. Define the task The command and current observation are used to identify the intended manipulation target and desired object-state change.
2. Build a short future A short-horizon future signal describes how the relevant object or articulated part should evolve.
3. Compress the future The future video is encoded, pooled into temporal bins, and projected into a compact future-conditioning vector.
4. Reliability-aware mixing RAFC compares the null future with shifted future candidates and gates how much the policy should trust the future signal.
5. Use it in closed loop The policy uses recent observations and the reliable future latent to generate actions during closed-loop execution.

Approach

We study control under imperfect future supervision, where the future signal used during training is not identical to the future available at inference time. The policy is trained and evaluated with a compact representation of a short future video, while the source of that future can vary. This lets us compare no-future conditioning, ground-truth futures, generated futures, and intentionally wrong futures under the same control interface.

Task grounding and future construction

Given a task command and current scene information, the system creates a structured description of the manipulation goal, including the target object, relevant interaction part, and desired state transition. This description is used to construct a short-horizon visual future that represents the intended evolution of the task. In the current paper, this is studied in simulation and is not presented as a full real-world scene reconstruction system.

Generated future signal

At inference time, ground-truth future frames from demonstrations are not available. We therefore evaluate generated future clips as imperfect test-time conditioning signals. These clips are intended to provide useful cues about upcoming object motion and interaction timing, but they can also contain errors or temporal mismatch. This makes them useful for studying how policies behave under realistic future-source mismatch.

Future-Experience Conditioning (FEC)

The future clip is encoded into per-frame embeddings, compressed into temporal bins, and projected into a fixed-dimensional future latent. This compact representation provides a common interface between future video signals and policy learning. We also test temporally shifted futures to evaluate how sensitive the policy is to timing mismatch in the future signal.

Reliability-Aware Future Conditioning (RAFC)

RAFC adds a lightweight reliability gate over a null future and temporally shifted future candidates. It is designed to reduce sensitivity to partially correct but misaligned generated futures, rather than to solve completely wrong future hypotheses.

Future-conditioned policy

The policy receives recent observations together with the compact future latent and outputs continuous actions in closed loop. We evaluate behavior cloning, behavior cloning with reinforcement-learning fine-tuning, and a future-conditioned comparison baseline. The focus of the study is not a claim about one universal controller, but the effect of future conditioning on execution and adaptation under controlled future conditions.

Results

BibTeX

@misc{khoshnazar2026futurehypotheses,
  title   = {LLM-Guided Future Hypotheses for Horizon-Aware Exploration in Multi-Step Robot Manipulation},
  author  = {Khoshnazar, Mohammad and Melnik, Andrew and Beetz, Michael},
  year    = {2026},
  url     = {https://enact2026.github.io/}
}