Skip to main content

SAC Algorithm

(SAC) is a policy gradient algorithm based on maximum entropy reinforcement learning. Compared to PPO, SAC is off-policy with higher sample efficiency; compared to TD3, SAC uses a stochastic policy with stronger exploration capabilities. The improved version of SAC achieves stability comparable to PPO.

Embodied Intelligence Perspective: SAC is widely used in scenarios requiring high sample efficiency, such as robotic arm manipulation and dexterous hand control. Due to its off-policy nature, SAC can reuse historical data, making it particularly advantageous for real robots where sampling is expensive.

Maximum Entropy Reinforcement Learning

Deterministic vs Stochastic Policies

  • Deterministic policies: Stable but lack exploration, prone to local optima (e.g., DQN, DDPG)
  • Stochastic policies: Flexible but with higher variance (e.g., A2C, PPO)

Maximum entropy RL argues that even with mature stochastic policies, "optimally random" behavior has not been achieved. It introduces information entropy, maximizing the policy's entropy alongside the cumulative reward:

where is the temperature factor, and is the policy's information entropy:

The more random the policy, the higher the entropy. The benefit of maximum entropy: when uncertain about which action is optimal, it tends to keep more options open, making the policy more robust.

Soft Q-Learning

Under the maximum entropy framework, the Q-value and V-value functions need to be redefined.

Soft Q-value function (with added entropy term):

Soft V-value function:

The V function is essentially a softmax form of the Q function (rather than hardmax) — this is where the "Soft" name comes from. When , it degenerates to the traditional Q function.

SAC Algorithm

SAC has two versions: v1 from 2018 and v2 from 2019. v2 mainly adds automatic temperature factor adjustment.

SAC v1 (2018)

SAC v1 contains three networks: V network, Q network, and policy network.

V network objective function:

Soft Q function objective:

Policy objective function:

SAC algorithm pseudocode

SAC v2 (2019) — Automatic Temperature Tuning

In v1, the temperature factor is a hyperparameter that significantly affects training performance. v2 transforms this into a constrained optimization problem: requiring the policy's entropy to be no lower than a threshold :

Through the Lagrangian multiplier method, can also be made a learnable parameter that automatically adjusts during training. This makes SAC v2 nearly hyperparameter-free, greatly improving its practicality.

PPO vs SAC Comparison

FeaturePPOSAC
Policy typeon-policyoff-policy
Sample efficiencyLower (requires many parallel environments)Higher (can reuse historical data)
Experience replayNot usedUsed
Action spaceDiscrete + continuousPrimarily continuous
Exploration methodPolicy stochasticityMaximum entropy (more systematic)
Typical scenarioLarge-scale parallel simulationReal robots, small-batch data
Tuning difficultySimpleSimple (v2 automatic temperature)

In embodied intelligence, if large-scale parallel simulation environments like Isaac Gym are available, PPO is preferred; if working with real robots or scenarios where simulation sampling is expensive, SAC is preferred.

Discussion

Why does maximum entropy improve policy robustness?

Maximum entropy encourages the policy to maintain probability across multiple viable actions, rather than concentrating all probability on a single action. This means:

  1. Even if one action is optimal in the training environment, the policy retains other backup actions
  2. When the environment changes slightly (sim-to-real gap), the policy still has other actions available
  3. This "conservative" randomness provides natural resistance to environmental uncertainty