We evaluate HIQL by varying only the high-level policy while keeping the low-level policy fixed. Using learned high-level policy, performance drops, whereas using the oracle high-level policy achieves high success rates, indicating the high-level policy is the main bottleneck.
As the distance between $s_t$ and $g$ increases, the value estimates become increasingly erroneous, leading to an imprecise evaluation of the high-level advantage.
By using temporally extended actions in planning, we reduce the effective horizon length, referring to the number of planning steps, to approximately $d^\star(s_t, g)/n$. Specifically, we modify the reward and target value to be option-aware, thereby ensuring that the high-level value $V^h$ is suitable for long-term planning.
HIQL ❌
OTA (Ours) ✅
HIQL ❌
OTA (Ours) ✅
@inproceedings{ota2025,
title={Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning},
author={Ahn, Hongjoon and Choi, Heewoong and Han, Jisu and Moon, Taesup},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025},
}