RL 100

15 minute read

1. Triage

One-Sentence Summary: This paper introduces RL-100, a three-stage (Imitation Learning, iterative offline RL, online RL) real-world training framework that fine-tunes diffusion visuomotor policies to achieve perfect 100% success on seven diverse robotic manipulation tasks[cite: 1, 5, 6, 7, 8, 11].

2. Technical Deep Dive

🤖 Methodology

The core of RL-100 is a three-stage pipeline designed to leverage human priors and then reliably exceed them through reinforcement learning, culminating in a “deployment-ready” policy[cite: 5, 14].

Stage 1: Imitation Learning (IL) A conditional diffusion policy is first trained via behavior cloning (BC) on a dataset of human-teleoperated demonstrations[cite: 6, 178]. This provides a competent, low-variance “base policy” that captures human strategies[cite: 43].
Stage 2: Iterative Offline RL This stage provides the “bulk of the improvement”[cite: 44]. It’s a virtuous cycle (summarized in Algorithm 1 [cite: 316]) that iteratively:
- Improves Policy: Fine-tunes the policy using a PPO-style objective with conservative updates[cite: 7]. To ensure stability, this update is “gated” by an Offline Policy Evaluation (OPE) procedure (AM-Q), which only accepts the new policy if it’s demonstrably better[cite: 7, 278].
- Collects Data: The improved policy is used to roll out and collect new, higher-quality trajectories[cite: 316].
- Expands Dataset: This new data is merged with the existing dataset[cite: 316].
- Re-trains IL Policy: Critically, a new IL policy is re-trained from scratch on this expanded dataset. The authors state this is “crucial” for stability and to “distill” the RL improvements into a unified, robust policy[cite: 316, 317, 321].
Stage 3: Online RL This is a final, brief “last-mile” phase[cite: 45]. It uses standard on-policy online RL (PPO-style) to eliminate rare, residual failure modes and push the success rate from high (e.g., 95%) to near-perfect[cite: 8, 45, 47].

Consistency Distillation (CM): Running parallel to the RL fine-tuning, the framework trains an additional lightweight consistency model[cite: 8, 296]. This model is trained to distill the knowledge of the multi-step (e.g., K=5-10) diffusion policy into a single-step policy[cite: 8, 296, 1100]. This is vital for deployment, as it reduces inference latency by an order of magnitude (e.g., from 100ms to 10ms) while preserving the 100% task performance[cite: 8, 310, 613].

🔄 Algorithm Loop

The main training pipeline is detailed in Algorithm 1[cite: 316]:

Input: Initial human demonstrations $\mathcal{D}_0$.
Initialize: Train an IL diffusion policy $\pi_0^{\Pi}$ on $\mathcal{D}_0$.
For $m=0$ to $M-1$ (Iterative Offline Loop):
- Train Critics & Model: Learn critics $Q_{\psi_m}$, $V_{\psi_m}$ (IQL-style) and a transition model $T_{\theta_m}$ on the current dataset $\mathcal{D}_m$[cite: 276, 316].
- Offline RL Improvement: Optimize the policy $\pi_m^{\text{ddim}}$ using the unified PPO-style objective (see Eq. 18 below) with IQL-based advantages $A_t^{\text{off}}$[cite: 275]. This update is gated by an OPE check (AM-Q) to ensure conservative, monotonic improvement[cite: 278, 281].
- Data Expansion: Deploy the improved policy $\pi_m^{\text{ddim}}$ to collect new rollout data $\mathcal{D}_{\text{new}}$[cite: 316].
- Merge Data: $\mathcal{D}{m+1} \leftarrow \mathcal{D}_m \cup \mathcal{D}{\text{new}}$[cite: 316].
- IL Re-training (Distillation): Train a new base policy $\pi_{m+1}^{\Pi}$ via imitation learning on the entire expanded dataset $\mathcal{D}_{m+1}$[cite: 316, 317].
Final Online Fine-tuning:
- Take the final iterated policy $\pi_{M-1}^{\text{ddim}}$ and its critics.
- Fine-tune it using on-policy OnlineRL with the same PPO objective, but using GAE-based advantages $A_t^{\text{on}}$[cite: 287, 316]. This produces $\pi_{\text{final}}^{\text{ddim}}$.
Output: The final K-step policy $\pi_{\text{final}}^{\text{ddim}}$ and its single-step distilled consistency model (CM) counterpart $\pi_{\text{final}}^{\text{cm}}$[cite: 316].

3. RL-Specific Analysis

🧠 Network Architecture

The framework is “agnostic” to representation and embodiment, but specific architectures are used[cite: 9, 71]:

Visual Encoder ($\phi$):
- Supports 2D RGB (using pretrained ResNet/ViT) or 3D Point Clouds (using an adapted DP3 encoder)[cite: 48, 229, 230].
- The encoder is trained with auxiliary reconstruction ($\mathcal{L}{\text{recon}}$)** and **variational information bottleneck ($\mathcal{L}{\text{KL}}$) losses to ensure stable, “drift-resistant” features during RL fine-tuning [cite: 75, 232-238].
Policy ($\pi_{\theta}$):
- This is a conditional diffusion model[cite: 179].
- The architecture backbone is shared, but the head and control mode are task-dependent:
  - Action-Chunk Control: Uses a U-Net backbone to predict a chunk of $n_c$ future actions ($[u_t, …, u_{t+n_c-1}]$)[cite: 60, 213, 223]. Used for precision tasks like Unscrewing[cite: 88, 244].
  - Single-Step Control: Uses a Skip-Net backbone to predict a single action $u_t$[cite: 59, 209, 223]. Used for reactive tasks like Agile Bowling[cite: 88, 244].
Critics: The framework learns a Q-network ($Q_{\psi}$) and a V-network ($V_{\psi}$), which are used to compute the advantage $A_t$ in both offline and online stages[cite: 275, 276, 287].
Consistency Model ($c_w$): A separate network (sharing the encoder) that is trained to output a clean action in a single step[cite: 296].

📉 Objective & Loss Function

The core innovation is a unified PPO-style policy gradient objective applied to the K-step diffusion denoising process[cite: 52, 176, 260].

The total loss during RL fine-tuning is a combination of the RL objective and the consistency distillation loss[cite: 299]:

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{RL} + \lambda_{CD} \cdot \mathcal{L}_{CD}\]

RL Policy Loss ($\mathcal{L}_{RL}$): This is the negative of the PPO surrogate objective (from Eq. 18), which is summed over the $K$ denoising steps[cite: 262, 264]:
\[\mathcal{L}_{RL} = -J_i(\pi) = -\mathbb{E}_{s_t \sim \rho_{\pi_i}, a_t \sim \pi_i} \left[ \sum_{k=1}^{K} \min \left( r_k(\pi) A_t, \text{clip}(r_k(\pi), 1-\epsilon, 1+\epsilon) A_t \right) \right]\]
- Key Insight: The same environment-level advantage $A_t$ is applied to every denoising step $k$ (from $k=1$ to $K$)[cite: 269]. This provides a dense learning signal to the entire denoising chain based on a single environment outcome.
- The advantage $A_t$ is computed differently in each phase:
  - Offline Stage: $A_t^{\text{off}} = Q_{\psi}(s_t, a_t) - V_{\psi}(s_t)$ (IQL-style)[cite: 275].
  - Online Stage: $A_t^{\text{on}} = \text{GAE}(\lambda, \gamma; r_t, V_{\psi})$ (standard GAE)[cite: 287].
- In the Online Stage, a standard value function loss is added: $\mathcal{L}V = \lambda_V \mathbb{E}[(V{\psi}(s_t) - \hat{V}_t)^2]$.
**Consistency Distillation Loss ($\mathcal{L}{CD}$):** This is a regression loss (from Eq. 9) that trains the consistency model $C{\theta}$ to match the one-shot output of the full K-step diffusion teacher policy $\Psi_{\varphi}$[cite: 151, 302]:
\[\mathcal{L}_{CD}(\theta) = \mathbb{E}_{x_0, \tau, \epsilon} \left[ || C_{\theta}(x^{\tau}, \tau) - \text{sg}[\Psi_{\varphi}(x^{\tau}, \tau \rightarrow 0)] ||_2^2 \right]\]
- $\text{sg}[\cdot]$ is the stop-gradient, ensuring the teacher policy $\Psi_{\varphi}$ (which is $\pi_{\theta}$) is fixed during this computation.

🔑 Key Equations

PPO-Style Objective over Denoising Steps (Eq. 18) [cite: 262]

\[J_{i}(\pi) = \mathbb{E}_{s_t \sim \rho_{\pi_i}, a_t \sim \pi_i} \left[ \sum_{k=1}^{K} \min \left( r_k(\pi) A_t, \text{clip}(r_k(\pi), 1-\epsilon, 1+\epsilon) A_t \right) \right]\]

$J_i(\pi)$: The surrogate objective at PPO iteration $i$ for the new policy $\pi$.
$\mathbb{E}[\dots]$: Expectation over states $s_t$ from the behavior policy’s state distribution $\rho_{\pi_i}$ and actions $a_t$ sampled from the behavior policy $\pi_i$.
$\sum_{k=1}^{K}$: Sum over all $K$ denoising steps within the diffusion sampler.

$r_k(\pi)$: The importance sampling ratio for the $k$-th denoising sub-policy, $r_k(\pi) = \frac{\pi(a^{\tau_{k-1}}

s^k)}{\pi_i(a^{\tau_{k-1}}

s^k)}$[cite: 274].

$A_t$: The environment-level advantage (computed via IQL or GAE) for the action $a_t$ taken at environment step $t$[cite: 267].
$\text{clip}(\dots)$: The standard PPO clipping function with hyperparameter $\epsilon$[cite: 262].

DDIM Stochastic Denoising Step (Eq. 5a, 5b) [cite: 117-119]
\[x_m = \mu_{\theta}(x_t, t \rightarrow m) + \sigma_{t \rightarrow m} \epsilon_{t \rightarrow m}\]
where the mean is:
\[\mu_{\theta}(x_t, t \rightarrow m) = \sqrt{\overline{\alpha}_m} \hat{x}_0(x_t, t) + \sqrt{1 - \overline{\alpha}_m - \sigma_{t \rightarrow m}^2} \epsilon_{\theta}(x_t, t)\]
- $x_t, x_m$: The noisy action at diffusion step $t$ and the “cleaner” action at step $m$ (where $m < t$)[cite: 115].
- $\epsilon_{\theta}(x_t, t)$: The denoiser network’s (i.e., the policy’s) prediction of the noise in $x_t$.
- $\hat{x}0(x_t, t)$: The predicted “clean” action, derived from $x_t$ and $\epsilon{\theta}$[cite: 112].
- $\overline{\alpha}_t, \overline{\alpha}_m$ : Cumulative noise schedule products[cite: 114].
- $\sigma_{t \rightarrow m}$ : The standard deviation (variance parameter) that injects stochasticity into the denoising step[cite: 116, 119]. This is the “action” of the sub-policy.
Consistency Distillation Loss (Eq. 9) [cite: 151]
\[\mathcal{L}_{CD}(\theta) = \mathbb{E}_{x_0, \tau, \epsilon} \left[ || C_{\theta}(x^{\tau}, \tau) - \text{sg}[\Psi_{\varphi}(x^{\tau}, \tau \rightarrow 0)] ||_2^2 \right]\]
- $C_{\theta}$: The single-step consistency model (the “student”)[cite: 148].
- $x^{\tau}$: A noisy sample at an arbitrary noise level $\tau$[cite: 148, 151].
- $\text{sg}[\cdot]$: The stop-gradient operator[cite: 151].
- $\Psi_{\varphi}(\dots \rightarrow 0)$: The K-step DDIM policy (the “teacher”), which runs its full denoising chain to produce a clean sample from $x^{\tau}$[cite: 151].

4. Experimental Validation (Concise Summary)

Setup:
- Real-World Environments: A diverse suite of 7 real-robot tasks covering rigid-body dynamics (Push-T, Bowling), fluid/granular pouring, deformable objects (Towel Folding), precision dexterity (Unscrewing), and multi-stage manipulation (Orange Juicing)[cite: 10, 86, 88].
- Simulation Environments: Standard RL benchmarks: Adroit (dexterous manipulation), Meta-World (multi-task), and MuJoCo locomotion[cite: 820].
- Baselines:
  - Real-World: Imitation-only policies (DP-2D, DP3)[cite: 598].
  - Simulation: SOTA diffusion/flow-based RL methods (DPPO, ReinFlow, DSRL)[cite: 804, 818].
Main Quantitative Finding:
- On the real-robot suite, RL-100 achieved a 100% success rate across all 7 tasks (900/900 total trials), including one run of 250/250 consecutive successes on the difficult dual-arm folding task [cite: 11, 611-613].
- This massively outperforms the strongest imitation baseline, DP3, which averaged 70.6% success[cite: 606].
- The policies also proved more efficient than human experts, with RL-100 achieving 20 successful Push-T episodes per unit time versus 17 for the human expert[cite: 786].
- The policies demonstrated strong zero-shot generalization to new dynamics (92.5% avg. success) [cite: 13, 648] and sample-efficient few-shot adaptation to task variations (86.7% avg. success)[cite: 13, 646].

5. Cross-Context Analysis

Since this is the first paper in our collaboration, there are no prior analyses to cross-reference. We will use this paper as a foundational benchmark for future discussions on real-world RL, diffusion policies, and offline-to-online methods.

6. Critical Analysis (as a Collaborator)

🧐 Stated Limitations

The authors explicitly identify the following in their “Future Work” section[cite: 1116]:

Scene Complexity: The current tasks are relatively clean. They state a priority is to “extend evaluation to more complex, cluttered, and partially observable scenes” (e.g., occlusions, multi-object settings, changing illumination) [cite: 1117-1118].
Scaling to Foundation Models: They plan to scale this post-training method to larger, multi-task Vision-Language-Action (VLA) models and investigate scaling laws [cite: 1121-1123].
Reset Bottleneck: They acknowledge that “reset and recovery remain practical bottlenecks” for real-world training and plan to investigate autonomous reset mechanisms [cite: 1124-1125].

🕵️ Implied Weaknesses

Based on the text, I infer a few practical challenges not framed as limitations:

Significant Data Collection Cost: The “Iterative Offline RL” phase is not purely offline; it is an offline-to-online loop that requires substantial real-world data collection to expand the dataset. For example, the Soft-towel Folding task required 5 hours of human demos, followed by 11 more hours of policy rollouts for the iterative stage, and 8.5 more hours for the final online stage[cite: 839, 842, 845]. This is a very high (though effective) data-and-time cost for a single task.
Task-Specific Customization: While the framework is presented as “task-agnostic”[cite: 9], the implementation details show significant per-task specialization. The paper uses:
- Different control modes (single-step vs. action-chunk)[cite: 87].
- Different network backbones (Skip-Net vs. U-Net)[cite: 209, 213, 215].
- Different reward structures (sparse for 5 tasks, dense-shaped for 1) [cite: 426-427].
- Different robot embodiments and end-effectors[cite: 88]. This suggests that applying the RL-100 framework to a new task still requires considerable expert design and tuning, rather than being a single, general-purpose model.
Reliance on Human Supervision: The “sparse” reward for 5 of the 7 tasks was provided by a “human supervisor pressing a keyboard”[cite: 426]. This means the training loop is not fully autonomous and requires persistent human oversight, which is a practical bottleneck for long-running experiments.
Stability Paradox of IL Re-training: The paper’s offline loop (Algo. 1, line 13) relies on re-training an IL policy on the expanded dataset[cite: 316]. They state this is “crucial” for stability and distillation [cite: 317-321]. This is an interesting paradox: the method aims to “go beyond human performance” [cite: 43] and escape the “imitation ceiling”[cite: 27], yet its stability seemingly depends on repeatedly “washing” the RL-discovered data back through a supervised imitation learning objective. This implies that the PPO-style fine-tuning of the diffusion policy alone may be unstable, and this IL re-training step is a necessary stabilization and distillation mechanism.

Here is a detailed tutorial on the RL-100 framework, incorporating the key elements from our discussion and a comparative analysis with WSRL.

RL-100 Explained: A Deep Dive into Real-World RL

Introduction: What is RL-100?

RL-100 is a real-world reinforcement learning framework designed to create “deployment-ready” robotic manipulation policies. Its primary contribution is a robust, three-stage training pipeline that starts with human data and reliably fine-tunes a policy to achieve 100% success on complex, real-robot tasks.

At its core, RL-100 is a “warm-start” actor-critic method, similar in principle to WSRL. It begins with a pre-trained agent and carefully fine-tunes it using online interaction, but its underlying philosophy and mechanisms are fundamentally different.

The RL-100 Framework: A Three-Stage Pipeline

RL-100 breaks training into three distinct phases to ensure stability and performance.

Stage 1: Imitation Learning (The Foundation)

The process begins by training a conditional diffusion policy via standard behavioral cloning (BC) on a dataset of human-teleoperated demonstrations. This provides a competent, low-variance “base policy” that captures effective human strategies.

Stage 2: Iterative Offline RL (The “Virtuous Cycle”)

This stage provides the bulk of the performance improvement. It’s an offline-to-online loop that iteratively refines the policy:

Improve Policy (Offline): The policy is fine-tuned using an offline RL objective on the current data buffer.
Collect Data (Online): This improved policy is used to roll out on the real robot and collect new, higher-quality trajectories.
Expand & Re-train: This new data is merged with the old data, and the base IL policy is re-trained on this new, expanded dataset.

Stage 3: Online RL (The “Last Mile”)

This is a final, brief “last-mile” phase. It uses a standard, on-policy online RL objective (PPO) to eliminate any rare, residual failure modes and push the success rate from high (e.g., 95%) to near-perfect.

Detailed Breakdown: The Core Components of RL-100

Here is a deeper look at the key components we discussed.

1. The Actor: A Diffusion Policy

The policy (or “actor”) in RL-100 is a conditional diffusion model. This means that selecting an action is a $K$-step denoising process. This K-step generative process is cleverly framed as a “sub-MDP,” which allows a PPO-style objective to be applied to each of the $K$ denoising steps.

Policy Output: Single-Action vs. Action-Chunk

The architecture of the diffusion actor is adapted to the task:

Control Mode	Single-Step Action	Action-Chunk
What it’s For	Reactive tasks (e.g., Agile Bowling) that need fast, closed-loop corrections.	Precision tasks (e.g., Unscrewing) that need smooth, coordinated motions.
What it Predicts	A single action $u_t$.	A sequence of $n_c$ future actions: $[u_t, …, u_{t+n_c-1}]$.
How it Helps	Allows the policy to react at every timestep.	Reduces Jitter: Executing a pre-planned sequence is smoother than 16 separate, jittery decisions. Limits Error: Fewer decision points mean fewer chances for errors to accumulate.
Network Head	Skip-Net	U-Net

2. The Critics: Q, V, and the Transition Model

RL-100 is an actor-critic method that uses three key learned functions:

$Q_{\psi}$ (Action-Value) & $V_{\psi}$ (State-Value): These are trained in the offline stage using an IQL-style method on the replay buffer. Their primary job is to compute a reliable, 1-step off-policy advantage estimate: $A^{\text{off}} = Q(s,a) - V(s)$.
$T_{\theta}$ (Transition Model): This is a learned dynamics model ($s, a \rightarrow s’$) of the real world. Its only function is to act as a safety gate during the Stage 2 offline update.

How the “OPE Gate” Works: Before the algorithm deploys a newly-trained policy, it must ensure it’s actually better.

Simulate: It runs imaginary rollouts of both the new policy and the old policy inside the learned Transition Model $T_{\theta}$.
Score: It “scores” these imaginary rollouts using the learned $Q_{\psi}$ function.
Decide: The new policy is only accepted if its simulated score is provably better than the old policy’s score. This “OPE Gate” prevents divergence and ensures conservative, monotonic improvement.

3. The “Offline” Loop (Stage 2) Explained

This stage is a hybrid “offline-to-online” cycle, not purely offline. Here is the true step-by-step process:

Start with a dataset $\mathcal{D}_m$ and its IL-trained policy $\pi_i$.
(Offline Phase): Train the critics ($Q, V, T$) on all of $\mathcal{D}_m$.
(Offline Phase): Perform an off-policy PPO update to improve the policy. This update uses:
- Data: Batches from $\mathcal{D}_m$.
- Advantage: The off-policy advantage $A^{\text{off}} = Q-V$.
- Constraint: The PPO clip, which forces the new policy $\pi$ to stay close to the “behavior” policy $\pi_i$.
(Offline Phase): The newly updated policy is checked by the OPE Gate.
(Online Phase): The accepted, improved policy is deployed on the real robot to collect a new batch of data, $\mathcal{D}_{\text{new}}$.
(Offline Phase): The dataset is expanded: $\mathcal{D}{m+1} = \mathcal{D}_m \cup \mathcal{D}{\text{new}}$. A new base IL policy is trained on $\mathcal{D}_{m+1}$, and the loop repeats.

4. The “Fast” Policy: Consistency Distillation (CM)

This is a parallel process to solve a key deployment problem: the DDIM (diffusion) policy is slow, requiring K steps to pick an action.

The “Teacher” (DDIM Policy): The main policy trained by the RL-100 loop. It is smart but slow.
The “Student” (Consistency Model): A separate, lightweight policy. Its only job is to copy the Teacher.
The Joint Loss: For efficiency, the Student is trained at the same time as the Teacher. The total loss is: $\mathcal{L}{total} = \mathcal{L}{RL} + \lambda_{CD}\mathcal{L}_{CD}$
- $\mathcal{L}_{RL}$ (the PPO loss) trains the Teacher.
- $\mathcal{L}_{CD}$ (the distillation loss) trains the Student to match the Teacher.

At the end, you deploy the Student (CM), which is 1-step fast but has all the intelligence of the K-step Teacher.

Crucial Clarification: “Single-Step” (CM) vs. “Single-Action” (Skip-Net) These two terms are not related.

Single-Action (Skip-Net): This describes the output of the policy (one action).
Single-Step (CM): This describes the inference speed of the policy (one forward pass).

You can have a Single-Step (CM) policy that outputs an Action-Chunk (U-Net). The CM is just the fast version of whatever policy (Skip-Net or U-Net) you trained.

Comparative Analysis: RL-100 vs. WSRL

Both RL-100 and WSRL are “warm-start” frameworks that initialize both their actor and critic from an offline pre-training phase. Both frameworks also identify the exact same core problem: catastrophic forgetting at the moment of transition from offline to online training.

However, their diagnoses and solutions are fundamentally different, stemming from their choice of RL algorithm.

Feature	WSRL (e.g., SAC-based)	RL-100 (PPO-based)
Base Algorithm	Value-Based (SAC)	Policy-Based (PPO)
Primary Challenge	Critic Instability. The pessimistic $Q$-function sees new online data, queries OOD targets, and enters a “downward spiral” of underestimation.	Actor Divergence. The policy (actor) will diverge to exploit value function errors if it’s not constrained, leading to out-of-distribution actions.
Key Network	$Q$-Function (Action-Value). This network is fragile because $Q(s, a)$ is sensitive to any new OOD action $a$.	$V$-Function (State-Value). This network is more stable because $V(s)$ is an average and less sensitive to small policy changes.
The “Bridge”	A One-Time Critic Recalibration (The “Warmup Phase”).	A Continuous Actor Constraint (The PPO Objective).
How it Works	It collects a small (e.g., 5k) buffer of online data first. This “recalibrates” the fragile $Q$-function on safe, in-distribution data, preventing the downward spiral.	It uses the PPO clipping mechanism to continuously constrain the actor, forcing it to stay close to the “safe” base policy. The actor is never allowed to diverge in the first place.

Share on

Twitter Facebook LinkedIn

刘昭宏 (Zhaohong Liu)