Skip to main content

Imitation Learning

Imitation Learning is an approach for learning policies from expert demonstrations, without the need to manually design reward functions. In embodied intelligence, many tasks (such as folding clothes, pouring water) have reward functions that are extremely difficult to define, while human demonstrations are relatively easy to obtain. This makes imitation learning one of the core paradigms in embodied intelligence.

Embodied Intelligence Perspective: In recent years, imitation learning has gained even more traction in embodied intelligence than traditional RL. Representative works include ACT (Action Chunking with Transformers), Diffusion Policy, RT-2, and others. The pipeline of collecting data via teleoperation + training with imitation learning has become the mainstream workflow for robot skill learning.

Why Imitation Learning is Needed

Traditional RL requires manually designing reward functions (reward engineering), which is extremely difficult for many real-world tasks:

  • Sparse rewards: A reward is only given when the clothes are fully folded; there is no signal during intermediate steps
  • Hard-to-define rewards: What counts as "elegantly pouring water"? The quantitative criteria are highly subjective
  • Safety constraints: Robots may damage themselves or the environment during exploration

Imitation learning directly learns from human (expert) demonstration data, bypassing the reward design challenge.

Behavioral Cloning (BC)

Behavioral Cloning (BC) is the simplest imitation learning method, converting the problem into supervised learning: given expert state-action pairs , train a policy network to fit the expert's behavior.

Training Objective

For continuous action spaces, MSE loss is typically used:

For discrete action spaces, cross-entropy loss is used:

Covariate Shift Problem

The biggest problem with BC is covariate shift: during training, the state distribution seen is the expert's, but during execution, small policy errors lead to states the expert never visited, causing errors to accumulate.

This means BC performs reasonably well on short-horizon tasks but tends to collapse on long-horizon tasks.

DAgger Algorithm

DAgger (Dataset Aggregation) addresses the covariate shift problem by iteratively collecting new data:

  1. Use the current policy to interact with the environment, collecting state sequences
  2. Ask the expert to label optimal actions for these states
  3. Add new data to the training set
  4. Retrain the policy with the augmented dataset
  5. Repeat until convergence

The key idea of DAgger is to provide expert guidance on "states the policy would actually reach," thereby mitigating covariate shift.

Limitation: Requires online expert annotation, which is costly to obtain in robot scenarios.

Inverse Reinforcement Learning (Inverse RL)

Inverse Reinforcement Learning (IRL) infers the reward function from expert demonstrations, then trains a policy using RL based on the learned reward:

Core Assumption

The expert is an optimal policy under some unknown reward function, meaning the expert's behavior maximizes cumulative reward.

Main Methods

  • Maximum Entropy IRL: Assumes the expert policy follows a maximum entropy distribution (similar to the idea in SAC)
  • GAIL (Generative Adversarial Imitation Learning): Uses a GAN framework where the discriminator distinguishes between expert and learner behavior, and the generator (policy) tries to make its behavior indistinguishable from the expert

The advantage of IRL is that the learned reward function can generalize to different environment configurations; the downside is high training complexity.

Modern Imitation Learning Methods

In recent years, many new imitation learning methods have emerged in the embodied intelligence field:

Action Chunking

Rather than predicting a single action per step, multiple future actions are predicted at once:

The representative work ACT (Action Chunking with Transformers) uses a CVAE + Transformer architecture and has achieved remarkable results on dexterous manipulation tasks. Action chunking effectively mitigates covariate shift because errors do not accumulate at every time step.

Diffusion Policy

Diffusion models are introduced into policy learning, modeling action generation as a denoising process:

Diffusion policies can model multimodal action distributions (multiple reasonable actions for the same state) and perform excellently on fine manipulation tasks.

Data Collection: Teleoperation

The effectiveness of imitation learning largely depends on the quality of demonstration data. Common data collection methods:

MethodAdvantagesDisadvantages
Kinesthetic teachingIntuitive, no extra equipmentLow precision, not suitable for complex tasks
Teleoperation joystickFlexible, suitable for various tasksRequires operator training
VR teleoperationImmersive, high precisionHigh equipment cost
Video demonstrationLowest data acquisition costRequires vision-to-action mapping

RL + IL: Combined Use

In practice, imitation learning and reinforcement learning are often used together:

  1. IL pretraining + RL fine-tuning: First learn an initial policy with BC, then further optimize with RL (e.g., PPO). This way RL starts from a reasonable policy, avoiding the inefficiency of starting from random exploration.

  2. IL provides demonstrations + RL provides reward: Use methods like GAIL to learn rewards from demonstrations, then optimize the policy with RL.

  3. Residual RL: Learn a residual policy on top of the BC policy, , where RL only needs to learn corrections to BC.

Summary

MethodCore IdeaAdvantagesDisadvantages
BCSupervised learning to fit expertSimple and fastCovariate shift
DAggerIterative collection + annotationMitigates shiftRequires online expert
IRL/GAILLearn reward functionGood generalizationComplex training
ACTAction chunk predictionExcellent for fine manipulationRequires high-quality data
Diffusion PolicyDiffusion model generates actionsMultimodal modelingSlower inference