Imitation Learning
Imitation Learning is an approach for learning policies from expert demonstrations, without the need to manually design reward functions. In embodied intelligence, many tasks (such as folding clothes, pouring water) have reward functions that are extremely difficult to define, while human demonstrations are relatively easy to obtain. This makes imitation learning one of the core paradigms in embodied intelligence.
Embodied Intelligence Perspective: In recent years, imitation learning has gained even more traction in embodied intelligence than traditional RL. Representative works include ACT (Action Chunking with Transformers), Diffusion Policy, RT-2, and others. The pipeline of collecting data via teleoperation + training with imitation learning has become the mainstream workflow for robot skill learning.
Why Imitation Learning is Needed
Traditional RL requires manually designing reward functions (reward engineering), which is extremely difficult for many real-world tasks:
- Sparse rewards: A reward is only given when the clothes are fully folded; there is no signal during intermediate steps
- Hard-to-define rewards: What counts as "elegantly pouring water"? The quantitative criteria are highly subjective
- Safety constraints: Robots may damage themselves or the environment during exploration
Imitation learning directly learns from human (expert) demonstration data, bypassing the reward design challenge.
Behavioral Cloning (BC)
Behavioral Cloning (BC) is the simplest imitation learning method, converting the problem into supervised learning: given expert state-action pairs
Training Objective
For continuous action spaces, MSE loss is typically used:
For discrete action spaces, cross-entropy loss is used:
Covariate Shift Problem
The biggest problem with BC is covariate shift: during training, the state distribution seen is the expert's, but during execution, small policy errors lead to states the expert never visited, causing errors to accumulate.
This means BC performs reasonably well on short-horizon tasks but tends to collapse on long-horizon tasks.
DAgger Algorithm
DAgger (Dataset Aggregation) addresses the covariate shift problem by iteratively collecting new data:
- Use the current policy
to interact with the environment, collecting state sequences - Ask the expert to label optimal actions
for these states - Add new data to the training set
- Retrain the policy with the augmented dataset
- Repeat until convergence
The key idea of DAgger is to provide expert guidance on "states the policy would actually reach," thereby mitigating covariate shift.
Limitation: Requires online expert annotation, which is costly to obtain in robot scenarios.
Inverse Reinforcement Learning (Inverse RL)
Inverse Reinforcement Learning (IRL) infers the reward function from expert demonstrations, then trains a policy using RL based on the learned reward:
Core Assumption
The expert is an optimal policy under some unknown reward function, meaning the expert's behavior maximizes cumulative reward.
Main Methods
- Maximum Entropy IRL: Assumes the expert policy follows a maximum entropy distribution (similar to the idea in SAC)
- GAIL (Generative Adversarial Imitation Learning): Uses a GAN framework where the discriminator distinguishes between expert and learner behavior, and the generator (policy) tries to make its behavior indistinguishable from the expert
The advantage of IRL is that the learned reward function can generalize to different environment configurations; the downside is high training complexity.
Modern Imitation Learning Methods
In recent years, many new imitation learning methods have emerged in the embodied intelligence field:
Action Chunking
Rather than predicting a single action per step, multiple future actions are predicted at once:
The representative work ACT (Action Chunking with Transformers) uses a CVAE + Transformer architecture and has achieved remarkable results on dexterous manipulation tasks. Action chunking effectively mitigates covariate shift because errors do not accumulate at every time step.
Diffusion Policy
Diffusion models are introduced into policy learning, modeling action generation as a denoising process:
Diffusion policies can model multimodal action distributions (multiple reasonable actions for the same state) and perform excellently on fine manipulation tasks.
Data Collection: Teleoperation
The effectiveness of imitation learning largely depends on the quality of demonstration data. Common data collection methods:
| Method | Advantages | Disadvantages |
|---|---|---|
| Kinesthetic teaching | Intuitive, no extra equipment | Low precision, not suitable for complex tasks |
| Teleoperation joystick | Flexible, suitable for various tasks | Requires operator training |
| VR teleoperation | Immersive, high precision | High equipment cost |
| Video demonstration | Lowest data acquisition cost | Requires vision-to-action mapping |
RL + IL: Combined Use
In practice, imitation learning and reinforcement learning are often used together:
-
IL pretraining + RL fine-tuning: First learn an initial policy with BC, then further optimize with RL (e.g., PPO). This way RL starts from a reasonable policy, avoiding the inefficiency of starting from random exploration.
-
IL provides demonstrations + RL provides reward: Use methods like GAIL to learn rewards from demonstrations, then optimize the policy with RL.
-
Residual RL: Learn a residual policy on top of the BC policy,
, where RL only needs to learn corrections to BC.
Summary
| Method | Core Idea | Advantages | Disadvantages |
|---|---|---|---|
| BC | Supervised learning to fit expert | Simple and fast | Covariate shift |
| DAgger | Iterative collection + annotation | Mitigates shift | Requires online expert |
| IRL/GAIL | Learn reward function | Good generalization | Complex training |
| ACT | Action chunk prediction | Excellent for fine manipulation | Requires high-quality data |
| Diffusion Policy | Diffusion model generates actions | Multimodal modeling | Slower inference |