Phil Winder, Oct 2020
Batch-constrained deep Q-learning (BCQ) provides experience in a different way. Rather than feeding the raw observations to the buffer-trained agent, BCQ trains another neural network to generate prospective actions using a conditional variational auto-encoder. This is a type of auto-encoder that allows you to generate observations from specific classes. This has the effect of constraining the policy by only generating actions that lead to states in the buffer. It also includes the ability to tune the model to generate random actions by adding noise to the actions, if desired.