Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning

NeurIPS 2025 (Spotlight)

Hongjoon Ahn^1*, Heewoong Choi^1*, Jisu Han^2*, Taesup Moon^1,2,3

(* denotes equal contribution)

¹Department of Electrical and Computer Engineering (ECE), Seoul National University ²Interdisciplinary Program in Artificial Intelligence (IPAI), Seoul National University
³IPAI / ASRI / INMC, Seoul National University

Paper arXiv Code

High-Level Policy is the Bottleneck of Hierarchical Policy.

We evaluate HIQL by varying only the high-level policy while keeping the low-level policy fixed. Using learned high-level policy, performance drops, whereas using the oracle high-level policy achieves high success rates, indicating the high-level policy is the main bottleneck.

Noisy Advantage Signals in Long-Horizon Regimes.

As the distance between $s_t$ and $g$ increases, the value estimates become increasingly erroneous, leading to an imprecise evaluation of the high-level advantage.

Our method:
Option-aware Temporally Abstracted (OTA) Value

By using temporally extended actions in planning, we reduce the effective horizon length, referring to the number of planning steps, to approximately $d^\star(s_t, g)/n$. Specifically, we modify the reward and target value to be option-aware, thereby ensuring that the high-level value $V^h$ is suitable for long-term planning.

Experimental Results

Evaluation on OGBench

HIQL struggles with long-horizon tasks, especially the humanoid mazes.
OTA achieves superior performance on long-horizon tasks.

Value estimation

OTA shows more monotonic and order-consistent value estimates than HIQL.
OTA improves high-level value estimation and policy learning in long-horizon tasks.

Videos

HumanoidMaze-large-navigate-v0

HIQL ❌

OTA (Ours) ✅

HumanoidMaze-giant-navigate-v0