Offline Goal Conditioned Reinforcement Learning with Temporal Distance Representations

Vivek Myers 1 Bill Chunyuan Zheng 1 Benjamin Eysenbach 2 Sergey Levine 1
1 UC Berkeley 2 Princeton University

Abstract

Learned successor features provide a powerful framework for learning goal-reaching policies. These representations are constructed such that similarity in the representation space predicts future outcomes, allowing goal-reaching policies to be extracted. Representations learned for forward inference have some practical limitations - stitching of behaviors does not arise naturally with forward objectives like contrastive classification, and additional regularization is required to enable valid policy extraction. In this work, we propose a new representation learning objective that enables extraction of goal-reaching policies. We show that when combined with existing quasimetric network parameterization and the right invariances, these representations let us learn optimal goal-reaching policies from offline data. On existing offline GCRL benchmarks, our representation learning objective improves performance with a simpler algorithm and fewer independent networks/parameters to learn relative to past methods.

Example trajectory in the antmaze environment.

Temporal Metric Distillation (TMD)

Comparison with prior goal-conditioned RL methods. Only TMD is able to learn use quasimetric architectures to learn optimal goal-reaching policies and distances with arbitrary stochastic dynamics.
Full Implementation

To learn a policy from this distance parameterization, we can simply use off-the-shelf policy extraction method and minimize actor loss: $\pi = \min_{\pi} \mathbb{E}_{\{s_{i},a_{i},s_{i}',g_{i}\}_{i=1}^{N} \sim \pi_{\beta}} \Bigl[ \sum_{i,j=1}^{N} d_{\theta}\bigl((s_{i},\pi(s_{i},g_{j})),g_{j}\bigr) \Bigr]$.

What Do These Invariances Mean?

An ideal distance metric should obey two properties: the triangle inequality and action invariance, which correspond to Bellman optimality.

The $\mathcal{T}$-invariance is represents Bellman consistency in temporal distance learning. Instead of enforcing Bellman consistency straight up, we use the following objective to enforce the same constraint on quasimetric architecture:

$ e^{-d_{\mathrm{MRN}}\bigl(\phi(s,a),\,\psi(g)\bigr)} \;\leftarrow\; \mathbb{E}_{s'\sim P(\,\cdot\mid s,a)}\Bigl[ e^{\log\gamma \;-\; d_{\mathrm{MRN}}\bigl(\psi(s'),\,\psi(g)\bigr)} \Bigr].$

The $\mathcal{I}$-invariance represents how optimal value and Q functions behave: $V^{*}(s) = \max_{a \in A} Q^{*}(s,a)$. To enforce this under a quasimetric architecture, we enforce the same constraint by minimizing the MRN distance between the state encoder and state-action encoder:

$ \mathcal{L}_{\mathcal{I}}\bigl(\phi, \psi;\{s_i,a_i,s'_i,g_i\}_{i=1}^N\bigr) = \sum_{i=1}^N \sum_{j=1}^N d_{\mathrm{MRN}}\bigl(\psi(s_i),\,\phi(s_i,a_j)\bigr) $.

Empirical Evaluation

We evaluate TMD on OGBench, and we observe that TMD outperforms other methods in a variety of locomotion and manipulation tasks. Across a collection of 80 tasks, we observe that TMD consistently outperforms TD-based methods such as GCIQL as well as distance learning methods such as QRL and CMD.


We ablate the loss components of TMD in the pointmaze_teleport_stitch environment.

We additionally ablate our method, demonstrating the need of all three components that require a strong distance metric for policy learning.

Ablating the objectives in the $\mathcal{T}$-backup experiment.