Imitation Learning

Imitation Learning is an approach for learning policies from expert demonstrations, without the need to manually design reward functions. In embodied intelligence, many tasks (such as folding clothes, pouring water) have reward functions that are extremely difficult to define, while human demonstrations are relatively easy to obtain. This makes imitation learning one of the core paradigms in embodied intelligence.

Embodied Intelligence Perspective: In recent years, imitation learning has gained even more traction in embodied intelligence than traditional RL. Representative works include ACT (Action Chunking with Transformers), Diffusion Policy, RT-2, and others. The pipeline of collecting data via teleoperation + training with imitation learning has become the mainstream workflow for robot skill learning.

Why Imitation Learning is Needed

Traditional RL requires manually designing reward functions (reward engineering), which is extremely difficult for many real-world tasks:

Sparse rewards: A reward is only given when the clothes are fully folded; there is no signal during intermediate steps
Hard-to-define rewards: What counts as "elegantly pouring water"? The quantitative criteria are highly subjective
Safety constraints: Robots may damage themselves or the environment during exploration

Imitation learning directly learns from human (expert) demonstration data, bypassing the reward design challenge.

Behavioral Cloning (BC)

Behavioral Cloning (BC) is the simplest imitation learning method, converting the problem into supervised learning: given expert state-action pairs , train a policy network to fit the expert's behavior.

Training Objective

For continuous action spaces, MSE loss is typically used:

For discrete action spaces, cross-entropy loss is used:

Covariate Shift Problem

The biggest problem with BC is covariate shift: during training, the state distribution seen is the expert's, but during execution, small policy errors lead to states the expert never visited, causing errors to accumulate.

This means BC performs reasonably well on short-horizon tasks but tends to collapse on long-horizon tasks.

DAgger Algorithm

DAgger (Dataset Aggregation) addresses the covariate shift problem by iteratively collecting new data:

Use the current policy to interact with the environment, collecting state sequences
Ask the expert to label optimal actions for these states
Add new data to the training set
Retrain the policy with the augmented dataset
Repeat until convergence

The key idea of DAgger is to provide expert guidance on "states the policy would actually reach," thereby mitigating covariate shift.

Limitation: Requires online expert annotation, which is costly to obtain in robot scenarios.

Inverse Reinforcement Learning (Inverse RL)

Inverse Reinforcement Learning (IRL) infers the reward function from expert demonstrations, then trains a policy using RL based on the learned reward:

Core Assumption

The expert is an optimal policy under some unknown reward function, meaning the expert's behavior maximizes cumulative reward.

Main Methods

Maximum Entropy IRL: Assumes the expert policy follows a maximum entropy distribution (similar to the idea in SAC)
GAIL (Generative Adversarial Imitation Learning): Uses a GAN framework where the discriminator distinguishes between expert and learner behavior, and the generator (policy) tries to make its behavior indistinguishable from the expert

The advantage of IRL is that the learned reward function can generalize to different environment configurations; the downside is high training complexity.

Modern Imitation Learning Methods

In recent years, many new imitation learning methods have emerged in the embodied intelligence field:

Action Chunking

Rather than predicting a single action per step, multiple future actions are predicted at once:

The representative work ACT (Action Chunking with Transformers) uses a CVAE + Transformer architecture and has achieved remarkable results on dexterous manipulation tasks. Action chunking effectively mitigates covariate shift because errors do not accumulate at every time step.

Diffusion Policy

Diffusion models are introduced into policy learning, modeling action generation as a denoising process:

Diffusion policies can model multimodal action distributions (multiple reasonable actions for the same state) and perform excellently on fine manipulation tasks.

Data Collection: Teleoperation

The effectiveness of imitation learning largely depends on the quality of demonstration data. Common data collection methods:

Method	Advantages	Disadvantages
Kinesthetic teaching	Intuitive, no extra equipment	Low precision, not suitable for complex tasks
Teleoperation joystick	Flexible, suitable for various tasks	Requires operator training
VR teleoperation	Immersive, high precision	High equipment cost
Video demonstration	Lowest data acquisition cost	Requires vision-to-action mapping

RL + IL: Combined Use

In practice, imitation learning and reinforcement learning are often used together:

IL pretraining + RL fine-tuning: First learn an initial policy with BC, then further optimize with RL (e.g., PPO). This way RL starts from a reasonable policy, avoiding the inefficiency of starting from random exploration.
IL provides demonstrations + RL provides reward: Use methods like GAIL to learn rewards from demonstrations, then optimize the policy with RL.
Residual RL: Learn a residual policy on top of the BC policy, , where RL only needs to learn corrections to BC.

Summary

Method	Core Idea	Advantages	Disadvantages
BC	Supervised learning to fit expert	Simple and fast	Covariate shift
DAgger	Iterative collection + annotation	Mitigates shift	Requires online expert
IRL/GAIL	Learn reward function	Good generalization	Complex training
ACT	Action chunk prediction	Excellent for fine manipulation	Requires high-quality data
Diffusion Policy	Diffusion model generates actions	Multimodal modeling	Slower inference

Why Imitation Learning is Needed​

Behavioral Cloning (BC)​

Training Objective​

Covariate Shift Problem​

DAgger Algorithm​

Inverse Reinforcement Learning (Inverse RL)​

Core Assumption​

Main Methods​

Modern Imitation Learning Methods​

Action Chunking​

Diffusion Policy​

Data Collection: Teleoperation​

RL + IL: Combined Use​

Summary​