I Introduction
Robots carry the promise of being the future backbone of transport, warehousing, and manufacturing. However, for ubiquitous robotics to materialize, we need to devise methods to develop robotic controllers faster and autonomously—leveraging machine learning and scaling up current design approaches. Top computing hardware and software companies (including Nvidia
[1], Google [2], and intrinsic [3]) are now working towards fast physicsbased simulations for robot learning. At the same time, because safety is a crucial component of cyberphysical systems operating in the real world, safe learningbased control and safe reinforcement learning (RL) have become bustling areas of academic research over the past few years [4].Nonetheless, the fastpaced progress of the field risks exacerbating some of the open problems of safe learning control. The continuous influx of new contributions can hamper the ability to discern the more significant results. We need to establish ways to fairly compare results yielded by learningbased controllers that leverage very different methodologies (as well as shared tools for the development and debugging of these controllers). We also need shared definitions and—more importantly—quantitative benchmarks to assess these controllers’ safety and robustness.
Our work was motivated by the lack of opensource, reusable tools for the comparison of learningbased control research and RL, as observed in [4]. While we acknowledge the importance of eventual realrobot experiments, here, we focus on simulation as a way to lower the barrier of entry and appeal to a larger fraction of the control and RL communities. To develop safe learningbased robot control, we need a simulation API that can (i) support modelbased approaches, (ii) express safety constraints, and (iii)
capture realworld nonidealities (such as uncertain physical properties and state estimation). Our ambition is that our software could bring closer, support, and speed up the work of control and RL researchers, allowing them to easily compare results. We strive for simple, modular, and reusable code which leverages two opensource tools popular with each of the two communities: PyBullet’s physics engine
[5] and CasADi’s symbolic framework [6].Physics  Rendering  Robots  Tasks  Uncertain  Constraints  Disturbances  Gym  Symb.  
Engine  Engine  Conditions  API  API  
safecontrolgym  Bullet  TinyRenderer, OpenGL  Cartpole, Quadrotor  Stabilization, Traj. Track.  Inertial Param., Initial State  State, Input  State, Input, Dynamics  Yes  Yes 
aisafetygridworlds [7]  n/a  Terminal  n/a  Grid Navigation  Initial State, Reward  State  Dynamics, Adversaries  No  No 
safetygym [8]  MuJoCo  OpenGL  Point, Car, Quadruped  Navig., Push Buttons, Box  Initial State  State  Adversaries  Yes  No 
realworldrlsuite [9]  MuJoCo  OpenGL  Cartpole to Humanoid  Stabilization, Locomotion  Inertial Param., Initial State  State  State, Input, Reward  No  No 
The contributions and features of safecontrolgym^{1}^{1}1https://github.com/utiasDSL/safecontrolgym (Figure 1) can be summarized as follows:

we provide opensource simulation environments with a novel, augmented Gym API—with symbolic dynamics, trajectory generation, and quadratic cost—designed to seamlessly interface with both RL and control approaches;

safecontrolgym allows specifying constraints and the randomization of a robot’s initial state and inertial properties through a portable configuration system—this is crucial to simplify the development and comparison of safe learningbased control approaches;

finally, our codebase includes opensource implementations of several baselines from traditional control, RL, and learningbased control (that we use to demonstrate how safecontrolgym supports insightful quantitative comparisons across fields).
Ii Related Work
Simulation environments such as OpenAI’s gym [10] and DeepMind’s dm_control have been proposed as a way to standardize the development of RL algorithms. However, these often comprise toy or highly abstracted problems that do not necessarily support meaningful comparisons with traditional control approaches. Furthermore, recent work [11] has highlighted that, even using these tools, RL research is often difficult to reproduce as it might hinge on careful hyperparameterizations or random seeds.
Faster and more accurate physicsbased simulators—such as Google’s Brax [2] and Nvidia’s Isaac Gym [1]—are becoming increasingly more popular in robotics research [12]. While MuJoCo has been the dominant force behind many of the physicsbased RL environments, it is not an opensource project. In this work, we leverage the Python bindings of the opensource C++ Bullet Physics [5] engine instead—which currently powers several reimplementations of original MuJoCo’s tasks as well as additional robotic simulations, including quadrotors [13] and quadrupeds.
The aspect of safety has been touched upon by previous RL environments suites although, we believe, in ways not entirely satisfactory for the development of safe robot control. DeepMind’s aisafetygridworlds [7], is a set of RL environments meant to assess the safety properties (including distributional shift, robustness to adversaries, and safe exploration) of intelligent agents. However, it is not specific to robotics, as these environments are purely grid worlds. OpenAI’s safetygym [8] and Google’s realworldrl_suite [9] both augment typical RL environments with constraint evaluation. They include a handful of—albeit simplified—robotic platforms such as 2wheeled robots and quadrupeds. Similarly to our work, realworldrl_suite [9] also includes perturbations in actions, observations, and physical quantities. However, unlike our work, [8, 9] leverage MuJoCo and lack the support for a symbolic framework to express a priori knowledge of a system’s dynamics or its constraints.
While safecontrolgym also includes a Gymstyle quadrotor environment, it is worth clarifying that this is especially intended for safe, lowlevel control rather than visionbased applications—like AirSim [14] or Flightmare [15]—or multiagent coordination—like gympybulletdrones [13].
Our work advances the stateoftheart (summarized in Table I) by providing (i), for the first time, symbolic models of the dynamics, cost, and constraints of an RL environment (to support traditional control and modelbased approaches); (ii) customizable, portable, and reusable constraints and physics disturbances (to facilitate comparisons and enhance repeatability); (iii) traditional control and learningbased control baselines (beyond just RL baselines).
Iii Bridging Reinforcement Learning and Learningbased Control Research
As pointed out in [16, 4], despite the undeniable similarities in their setup, there still exist terminology gaps and disconnects in how optimal control and reinforcement learning research address safe robot control. In [4], as we reviewed the last halfdecade of research in safe robot control, we observed significant differences in the use and reliance on prior models and assumptions. We also found a distinct lack of opensource simulations and control implementations—which are essential for repeatability and comparisons across fields and methodologies. With this work, we intend to make it easier for both RL and control researchers to (i) publish their results based on opensource simulations, (ii) easily compare against both RL and traditional control baselines, and (iii) quantify safety against shared sets of constraints or dynamics disturbances.
Iv Environments
Our opensource suite safecontrolgym comprises 3 dynamical systems based on 2 platforms (cartpole, 1D, and 2D quadrotors) and 2 control tasks (stabilization and trajectory tracking). It can be downloaded and installed as:
As advised in [17], “benchmark problems should be complex enough to highlight issues in controller design […] but simple enough to provide easily understood comparisons.”. We include the cartpole as a dynamic system that has been widely popular and adopted to showcase traditional control as well as RL since the mid80s [18]. All three systems in safecontrolgym are unstable. The 1D quadrotor is linear, the 2D one is nonlinear and the cartpole is nonminimum phase. The 1D quadrotor is a simpler system that can also be used for didactic purposes.
Iva Cartpole System
A description of the cartpole system is given in Figure 3: a cart with mass connects via a prismatic joint to a 1D track; a pole, of mass and length , is hinged to the cart. The state vector for the cartpole is , where is the horizontal position of the cart, is the velocity of the cart, is the angle of the pole with respect to vertical, and is the angular velocity of the pole. The input to the system is a force , applied to the center of mass (COM) of the cart. In the frictionless case, the equations of motion for the cartpole system are given in [19] as:
(1a)  
(1b) 
where the subscript denotes continuous time and is the acceleration due to gravitation.
IvB 1D and 2D Quadrotor Systems
The second and third robotic systems in safecontrolgym are the 1D and the 2D quadrotor. These correspond to the cases in which the movement of a quadrotor is constrained to the 1D motion in the direction and the 2D motion in the plane, respectively. For a physical quadrotor, these motions can be achieved by setting the four motor thrusts of the quadrotor to balance out the force and torque along the redundant dimensions (i.e. identically, for the 1D case, or identically with respect to the plane symmetry, for the 2D case). Schematics of the 1D and 2D quadrotor environments are given in Figure 3.
In the 1D quadrotor case, the state of the system is , where and are the vertical position and velocity of the COM of the quadrotor. The input to the system is the overall thrust generated by the motors of the quadrotor. The equation of motion for the 1D quadrotor system is
(2) 
where is the mass of the quadrotor and is the acceleration due to gravitation.
In the 2D quadrotor case, the state of the system is , where and are the translation position and velocity of the COM of the quadrotor in the plane, and and are the pitch angle and the pitch angle rate, respectively. The input of the system are the thrusts generated by two pairs (one on each side of the body’s axis) of motors. The equations of motion for the 2D quadrotor system are as follows:
(3a)  
(3b)  
(3c) 
where is the mass of the quadrotor, is the acceleration due to gravitation,
is the effective moment arm (with
being the arm length of the quadrotor, i.e., the distance from each motor pair to the COM), and is the moment of inertia about the axis.IvC Stabilization and Trajectorytracking Tasks
All three systems in Sections IVA and IVB can be assigned to one of two control tasks: (i) stabilization and (ii) trajectory tracking. In RL, an agent/controller’s performance is expressed by the total collected reward . The traditional reward function for cartpole stabilization [18, 10] is simply a positive instantaneous reward for each time step in which the pole is upright (as episodes are terminated when exceeds threshold ):
(4) 
For controlbased approaches, safecontrolgym allows to replace the RL reward with the quadratic cost:
(5) 
where , is an equilibrium pair for the system to which we want to stabilize and , are parameters of the cost function. The negated quadratic cost can also be used as the RL reward for the quadrotor stabilization task.
For trajectory tracking, safecontrolgym includes a trajectory generation module capable of generating circular, sinusoidal, lemniscate, or square trajectories for episodes with an arbitrary length of control steps. The module returns references , . To run a quadrotor example, tracking different trajectories, try:
The quadratic cost of trajectory tracking, is computed as in (5), replacing , with ,
. Again, the negated cost also serves as the RL reward function. The RL state is further augmented with the target position in the trajectory to define a valid Markov decision process.
IvD safecontrolgym Extended API
To provide native support to opensource RL libraries, safecontrolgym adopts OpenAI Gym’s interface. However, to the best of our knowledge, we are the first to extend this API with the ability to provide a learning agent/controller with a priori knowledge of the dynamical system. This is of fundamental importance to also support the development of and comparison with learningbased control approaches (that typically leverage insights about the physics of a robotic system). We believe this prior information should not be discarded but rather integrated into the learning process. An overview of safecontrolgym’s features—and how to interact with a learningbased controller—is presented in Figure 2. Our benchmark suite can be used, for example, to answer the question of how much dataefficiency—which is crucial in robot learning—is forfeited by modelfree RL approaches that do not exploit prior knowledge (see Section VI). To run one of safecontrolgym’s environments—in headless mode—with printouts from the original Gym API (in blue) and our new API (in red), try:
Environment  GUI  Control  PyBullet  Constr.&  Speedup 
Freq.  Freq.  Disturb.  
cartpole  Yes  Hz  Hz  No  
cartpole  No  Hz  Hz  No  
cartpole  No  Hz  Hz  Yes  
quadrotor  Yes  Hz  Hz  No  
quadrotor  No  Hz  Hz  No  
quadrotor  No  Hz  Hz  Yes  
Running the environment with default constraints and disturbances
2.30GHz QuadCore i71068NG7; 32GB 3733MHz LPDDR4X
IvD1 Symbolic Models
We use CasADi [6], an opensource symbolic framework for nonlinear optimization and algorithmic differentiation, to include symbolical models of (i) our systems’ a priori dynamics—i.e., those in Sections IVA and IVB, not accounting for the disturbances in IVD3—as well as (ii) the quadratic cost function from Section IVC and (iii) optional constraints (see Section IVD2). As shown by the printouts of the snippet above, these models, together with the initial state of the system and task references , , are exposed by our API in a reset_info dictionary returned by each reset of an environment.
IvD2 Constraints
The ability to specify, evaluate, and enforce one or more constraints on state and input :
(6) 
is essential for safe robot control. While previous RL environments including state constraints exist [8, 9], our implementation is the first to also provide (i) their symbolic representation and (ii) the ability to create bespoke ones while creating an environment (see Section IVD4). Our current implementation includes default constraints and supports userspecified ones in multiple forms (linear, bounded, quadratic) on either the system’s state, input, or both. Constraint evaluations are included in the info dictionary returned at each environment’s step.
IvD3 Disturbances
In developing safe control approaches, we are often confronted with the fact that models like the ones in Sections IVA and IVB are not a complete or fully truthful representation of the system under test. safecontrolgym provides several ways to implement nonidealities that mimic reallife robots, including:

the randomization (from a given probability distribution) of the initial state of the system,
; 
the randomization (from given probability distributions) of the inertial parameters—i.e.,
, , for the cartpole and , for the quadrotor; 
disturbances (in the form of white noise, step, or impulse) applied to the action input
sent from the controller to the robot; 
disturbances (in the form of white noise, step, or impulse) applied to the observations of the state returned by an environment to the controller;

dynamics disturbances, including additional forces applied to a robot using PyBullet APIs; these can also be set deterministically from outside the environment, e.g., to implement adversarial training as in [20].
IvD4 Configuration System
IvE Computational Performance
Because deep learning methods can be especially datahungry—and the ability to collect experimental datasets or generate simulated ones is one of the bottlenecks of learningbased robotics—we assessed the computational performance of
safecontrolgym on a system with a 2.30GHz QuadCore i71068NG7 CPU, 32GB 3733MHz LPDDR4X of memory, and running Python 3.7 under macOS 11. Table II summarizes the obtained simulation speedups (with respect to the wallclock) for the cartpole and 2D quadrotor environment, in headless mode or using the GUI, with or without constraint evaluation, and different choices of control and physics integration frequencies. In headless mode, a single instance of safecontrolgym allows to collect data 10 to 20 times faster than in real life, with accurate physics stepped by PyBullet at 1000Hz.V Control Algorithms
The codebase of safecontrolgym also comprises an array of implementations of control approaches, from traditional control to safety certified control, passing by learningbased control and safe reinforcement learning.
Va Control and Safe Control Baselines
As baselines, our benchmark suite includes standard statefeedback control approaches such as the linear quadratic regulator (LQR) and iterative LQR (iLQR)[21]. The LQR controller deals with systems having linear dynamics (as in (2)) and quadratic cost (as in (5)). For nonlinear systems (e.g., the 2D quadrotor in (3) and cartpole in (1)), the LQR controller uses local linear approximations of the nonlinear dynamics. The iLQR controller is similar to the LQR but iteratively improves the performance by finding better local approximations of the cost function (5) and system dynamics using the state and input trajectories from the previous iteration. All the environments in safecontrolgym expose the symbolic model of an a priori dynamics, facilitating the computation of its Jacobians and the Jacobians and Hessians of the cost function. While we include LQR and iLQR to showcase the modelbased aspect of our benchmark, the symbolic expressions of the firstorder and secondorder terms included in each environment can be equivalently leveraged by other modelbased control approaches.
We also include two predictive control baselines: Linear Model Predictive Control (LMPC) and Nonlinear Model Predictive Control (NMPC) [22]. At every control step, Model Predictive Control (MPC) solves a constrained optimization problem to find a control input sequence, over a finite horizon, that minimizes the cost of the system’s predicted dynamics—possibly subject to input and state constraints. Then, the first optimal control input from the sequence is applied. While NMPC uses the nonlinear system model, LMPC uses the linearized approximation to predict the evolution of the system, sacrificing prediction accuracy for computational efficiency. In our codebase, CasADi’s opti framework is used to formulate the optimization problem. As explained in Section IVD, safecontrolgym provides all the system’s components required by MPC (a priori dynamics, constraints, cost function) as CasADi models.
VB Reinforcement Learning Baselines
As safecontrolgym extends the original Gym API, any compatible RL algorithm can directly be applied to our environments. In our codebase, we include two of the most wellknown RL baselines: Proximal Policy Optimization (PPO) [23] and Soft ActorCritic (SAC) [24]
. These are modelfree approaches that map sensor/state measurements to control inputs (without leveraging a dynamics model) using neural network (NN)based policies. Both PPO and SAC have been shown to work on a wide range of simulated robotics tasks, some of which involve complex dynamics. We adapt their implementations from
stablebaselines3 [25] and OpenAI’s Spinning Up, with a few modifications to also support our suite’s configuration systems. PPO and SAC are not natively safetyaware approaches and do not guarantee constraint satisfaction nor robustness (beyond the generalization properties of NNs).VC Safe Learningbased Control
Safe learningbased control approaches improve a robot’s performance using past data to improve the estimate of a system’s true dynamics as well as providing guarantees on stability and/or constraint satisfaction. One of these approaches, included in safecontrolgym, is GPMPC [26]. This method models uncertain dynamics using a Gaussian process (GP) which it uses to better predict the future evolution of the system as well as tighten constraints, based on the confidence of the dynamics along the prediction. GPMPC has been demonstrated for the control of groundbased mobile robots [26]. Our implementation leverages the LMPC controller, based on the environments’ symbolic a priori model, and uses gpytorch for the GP modelling and optimization. GPMPC can accommodate both environment and controllerspecific constraints.
VD Safe and Robust Reinforcement Learning
Building upon the RL baselines, we implemented three safe RL approaches that address the problems of constraint satisfaction and robust generalization. The safety layerbased approach in [27] pretrains NN models to approximate linearized state constraints. These learned constraints are then used to filter potentially unsafe inputs from an RL controller via leastsquares projection. We add such a safety layer to PPO and apply it to our benchmark tasks with simple bound constraints. Robust RL aims to learn policies that generalize across systems or tasks. We adapt two methods based on adversarial learning: RARL [20] and RAP [28]. These model dynamics disturbances as a learning adversary and train the policy against increasingly stronger ones. The resulting controllers are shown, in simulation [20, 28], to be robust against parameter mismatch. These methods can be directly trained in safecontrolgym, thanks to its dynamics disturbances API (see Section IVD3).
VE Safety Certification of Learned Controllers
Learned controllers lacking formal guarantees can be rendered safe by safety filters. These filters minimally modify unsafe control inputs, so that the applied control input maintains the system’s state within a safe set. Model predictive safety certification (MPSC) uses a finitehorizon constrained optimization problem with a discretetime predictive model to prevent a learningbased controller from violating constraints [29]. In [4], we presented an implementation of MPSC for PPO simultaneously leveraging the CasADi a priori dynamics and constraints and Gym RL interface of safecontrolgym.
Control barrier functions (CBF) are safety filters for continuoustime nonlinear controlaffine systems using quadratic programming (QP) with a constraint on the CBF’s time derivative with respect to the system dynamics [30]. In the case of model errors, the resulting errors in the CBF’s time derivative can be learned by an NN [31]. Learningbased CBF filters have been applied to safely control a segway [31] and a quadrotor [32]. Again, our CBF implementation relies on the a priori model and constraints exposed by the API of safecontrolgym. The CBF’s time derivative is also efficiently determined using CasADi. Constraints can be handled as long as the constraint set contains the CBF’s superlevel set.
Vi Results
To demonstrate how our work supports the development and test of all the families of control algorithms discussed in Section V, we present their control performance (Figures 4 and 5), learning efficiency (Figure 6), and constraint satisfaction (Figure 7) across identical safecontrolgym task environments. Figure 8 also demonstrates how to use our suite to test a controller’s robustness to disturbances and parametric uncertainty.
As we did not focus on each approach’s parameter tuning, the goal here is not to claim superiority of one approach over the other but rather to show how safecontrolgym allows to plot RL and control results on a common set of axes.
Via Control Performance
In Figures 4 and 5, we show that the LQR (with the true and overestimated by 50% parameters), GPMPC, PPO, and SAC are able to stabilize the cartpole and track the quadrotor trajectory reference. For the stabilization task, GPMPC closely matches the closedloop trajectory of the LQR with true parameters, albeit its a priori model was the same one given to the LQR with overestimated parameters (). This shows how GPMPC can overcome imperfect initial knowledge through learning. Both PPO and SAC yield substantially different closedloop trajectories when compared to LQR and GPMPC. This is likely a result of the difference in reward (4) and cost functions (5), for the stabilization task, and RL observations, for trajectory tracking. Indeed, the choices made in the expression of the objective add a layer of complexity to the equitable comparison of RL and learningbased control. Tracking the sinusoidal trajectories (Figure 5) introduces lowfrequency oscillations in and for PPO and SAC. GPMPC’s planning horizon, on the other hand, effectively avoids these.
ViB Learning Performance and Data Efficiency
Figure 6 shows how much data GPMPC, PPO, and SAC require to achieve comparable performance on an identical evaluation cost. This plot showcases the type of interdisciplinary comparisons enabled by safecontrolgym. In both plots, the untrained GPMPC displays a performance that is only matched after and seconds of simulated data by the RL approaches. GPMPC converges to its optimal performance with roughly one tenth of the data. This highlights how learningbased control approaches are orders of magnitude more dataefficient than modelfree RL. However, this is largely the result of knowing a reasonable a priori model (whether accurate or not). The evaluation costs of PPO and SAC exhibit large oscillations and learning instability, not uncommon in deep RL [11]. Once converged SAC and PPO reach performance comparable to GPMPC in the stabilization task. PPO also matches, but not securely, GPMPC on the tracking task.
ViC Safety: Constraint Satisfaction
In Figure 7, we investigate the impact of learning and training data on the constraint violations of a learningbased controller or safe RL agent. The top plot summarizes the data efficiency of these approaches on the cartpole stabilization task. Again, leveraging an a priori model, GPMPC and the learningbased CBF require much fewer training examples to minimize the number of constraint violations than PPO with a safety layer. After training, GPMPC, learningbased CBF, and safety layer PPO all achieve similar constraint satisfaction performance. Vanilla PPO also reduces the number of constraint violations but cannot match the performance of the GPMPC and the learningbased CBF.
The bottom plot of Figure 7 shows reduced constraint violations for GPMPC and PPO with a safety layer for the 2D quadrotor tracking task. Compared to a linear MPC with overestimated parameters, the GPMPC meets the constraints, finding a compromise between performance and constraint satisfaction. PPO with a safety layer, on the other hand, neither tracks the desired trajectory nor is it able to fully guarantee constraint satisfaction.
ViD Safety: Robustness
Figure 8 shows how robust controllers and RL agents are with respect to parametric uncertainty (in the pole length) and white noise (on the input) for cartpole stabilization. is trained with pole length randomization and improves the robustness of baseline PPO. RAP is trained against adversarial input disturbances and shows robust performance to input noise, as expected, but not to parameter mismatch. Modelbased approaches, LQR and GPMPC, appears less affected to parameter uncertainty than modelfree RL but are equally or more hindered by input noise.
Vii Conclusions and Future Work
In this letter, we introduced safecontrolgym, a suite of simulation and evaluation environments for safe learningbased control. We were motivated by the lack of an easytouse software benchmark, exposing all the features required to support the development of approaches from both the RL and control theory communities. In safecontrolgym, we combine (i) a physics enginebased simulation with (ii) the description of the available prior knowledge and safety constraints using a symbolic framework. By doing so, we allow the development and test of a wide range of approaches, from modelfree RL to learningbased MPC. We believe that safecontrolgym will make it easier for researchers from the RL and control communities to compare their progress, especially for the quantification of safety and robustness. Our next steps will include the extension of safecontrolgym to more robotic platforms, tasks, and additional safe learningbased control approaches implementations.
Viii Acknowledgments
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada Research Chairs Program, the CIFAR AI Chair, and Mitacs’s Elevate Fellowship program.
References
 [1] V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, and G. State, “Isaac gym: High performance gpubased physics simulation for robot learning,” arXiv:2108.10470 [cs.RO], 2021.
 [2] C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax – a differentiable physics engine for large scale rigid body simulation,” arXiv:2106.13281 [cs.RO], 2021.
 [3] W. TanWhite, “Introducing intrinsic,” Jul 2021. [Online]. Available: https://blog.x.company/introducingintrinsic1cf35b87651
 [4] L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig, “Safe learning in robotics: From learningbased control to safe reinforcement learning,” Annual Review of Control, Robotics, and Autonomous Systems, vol. to appear, 2021.
 [5] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2021.
 [6] J. A. E. Andersson, J. Gillis, G. Horn, J. B. Rawlings, and M. Diehl, “CasADi – A software framework for nonlinear optimization and optimal control,” Mathematical Programming Computation, vol. 11, no. 1, pp. 1–36, 2019.
 [7] J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg, “Ai safety gridworlds,” arXiv:1711.09883 [cs.LG], 2017.
 [8] A. Ray, J. Achiam, and D. Amodei, “Benchmarking Safe Exploration in Deep Reinforcement Learning,” https://cdn.openai.com/safexpshort.pdf, 2019.
 [9] G. DulacArnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru, S. Gowal, and T. Hester, “An empirical investigation of the challenges of realworld reinforcement learning,” arXiv:2003.11881 [cs.LG], 2021.
 [10] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv:1606.01540 [cs.LG], 2016.

[11]
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep
reinforcement learning that matters,” in
Proceedings of the AAAI Conference on Artificial Intelligence
, vol. 32(1). Palo Alto, CA: AAAI Press, Apr. 2018.  [12] J. Collins, S. Chand, A. Vanderkop, and D. Howard, “A review of physics simulators for robotic applications,” IEEE Access, vol. 9, pp. 51 416–51 431, 2021.
 [13] J. Panerati, H. Zheng, S. Zhou, J. Xu, A. Prorok, and A. P. Schoellig, “Learning to fly—a gym environment with pybullet physics for reinforcement learning of multiagent quadcopter control,” arXiv:2103.02142 [cs.RO], 2021.
 [14] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: Highfidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics. Springer Int’l Publishing, 2018, pp. 621–635.
 [15] Y. Song, S. Naji, E. Kaufmann, A. Loquercio, and D. Scaramuzza, “Flightmare: A flexible quadrotor simulator,” in Proc. of the 4th Conference on Robot Learning. Cambridge MA, USA.: PMLR, 2020.
 [16] B. Recht, “A tour of reinforcement learning: The view from continuous control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 2, no. 1, pp. 253–279, 2019.
 [17] J. P. How, “Benchmarks [from the editor],” IEEE Control Systems Magazine, vol. 35, no. 1, pp. 6–7, 2015.
 [18] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC13, no. 5, pp. 834–846, 1983.
 [19] R. V. Florian, “Correct equations for the dynamics of the cartpole system,” 2007.
 [20] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversarial reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning. N.p.: PMLR, 06–11 Aug 2017, vol. 70, pp. 2817–2826.
 [21] J. Buchli, F. Farshidian, A. Winkler, T. Sandy, and M. Giftthaler, “Optimal and learning control for autonomous robots,” arXiv:1708.09342 [cs.SY], 2017.
 [22] J. B. Rawlings, D. Q. Mayne, and M. M. Diehl, Model Predictive Control: Theory, Computation, and Design. Nob Hill Publishing, 2020, vol. 2nd.
 [23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv:1707.06347 [cs.LG], 2017.
 [24] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” in Proceedings of the 35th International Conference on Machine Learning, vol. 80. N.p.: PMLR, 10–15 Jul 2018, pp. 1861–1870.
 [25] A. Raffin, A. Hill, M. Ernestus, A. Gleave, A. Kanervisto, and N. Dormann, “Stable baselines3,” https://github.com/DLRRM/stablebaselines3, 2019.
 [26] L. Hewing, J. Kabzan, and M. N. Zeilinger, “Cautious model predictive control using gaussian process regression,” IEEE Transactions on Control Systems Technology, vol. 28, no. 6, pp. 2736–2743, 2020.
 [27] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” arXiv:1801.08757 [cs.AI], 2018.
 [28] E. Vinitsky, Y. Du, K. Parvate, K. Jang, P. Abbeel, and A. Bayen, “Robust reinforcement learning using adversarial populations,” arXiv:2008.01825 [cs.LG], 2020.
 [29] K. P. Wabersich and M. N. Zeilinger, “Linear model predictive safety certification for learningbased control,” in 2018 IEEE Conference on Decision and Control (CDC). Piscataway, NJ: IEEE, 2018, pp. 7130–7135.
 [30] A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control barrier functions: Theory and applications,” in 2019 18th European Control Conference (ECC). Piscataway, NJ: IEEE, 2019, pp. 3420–3431.
 [31] A. Taylor, A. Singletary, Y. Yue, and A. Ames, “Learning for safetycritical control with control barrier functions,” in Proceedings of the 2nd Conference on Learning for Dynamics and Control. N.p.: PMLR, 10–11 Jun 2020, vol. 120, pp. 708–717.
 [32] L. Wang, E. A. Theodorou, and M. Egerstedt, “Safe learning of quadrotor dynamics using barrier certificates,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). Piscataway, NJ: IEEE, 2018, pp. 2460–2465.
Comments
There are no comments yet.