Abstract
Stable and efficient function approximation methods in model-based reinforcement education (MBRL) have proven to be a persistent challenge, despitethe topic's expanding popularity. We offer an original structure to make the real-world difficulties in MBRL more understandable and to streamline algorithmic design at a higher abstraction level. In this way, MBRL is thought of as a two-person game: The policy player maximizes rewards by applying the learned model. The goal of the model player is to accurately reproduce the empirical facts that the policy player has gathered. We show that an approximate equilibrium in this game may be found, leading to a near-optimal strategy for the environment. Towards this goal, we provide two families of algorithms that draw inspiration from ideas found in Stackelberg games. Test results show that our suggested methods equal the overall efficacy of model-free policy gradient approaches and attain state-of-the-art sample efficiency. Moreover, these algorithms show a seamless transfer to extremely complex tasks such as deft hand manipulation.