Smoothing policies and safe policy gradients
Mostra el registre complet Registre parcial de l'ítem
- dc.contributor.author Papini, Matteo
- dc.contributor.author Pirotta, Matteo
- dc.contributor.author Restelli, Marcello
- dc.date.accessioned 2023-03-15T14:15:35Z
- dc.date.available 2023-03-15T14:15:35Z
- dc.date.issued 2022
- dc.description.abstract Policy gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only PG from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of PG estimators, allows us to identify meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimates. Through a joint, adaptive selection of these meta-parameters, we obtain a PG algorithm with monotonic improvement guarantees.
- dc.description.sponsorship Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. There has been no significant financial support for this work that could have influenced its outcome. M. Papini was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No.~950180).
- dc.format.mimetype application/pdf
- dc.identifier.citation Papini M, Pirotta M, Restelli M. Smoothing policies and safe policy gradients. Mach Learn. 2022;111(11):4081-137. DOI: 10.1007/s10994-022-06232-6
- dc.identifier.doi http://dx.doi.org/10.1007/s10994-022-06232-6
- dc.identifier.issn 0885-6125
- dc.identifier.uri http://hdl.handle.net/10230/56240
- dc.language.iso eng
- dc.publisher Springer
- dc.relation.ispartof Machine Learning. 2022;111(11):4081-137.
- dc.relation.isreferencedby https://github.com/T3p/potion
- dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/950180
- dc.rights © The Author(s) 2022. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
- dc.rights.accessRights info:eu-repo/semantics/openAccess
- dc.rights.uri http://creativecommons.org/licenses/by/4.0/
- dc.subject.keyword Reinforcement learning
- dc.subject.keyword Safe learning
- dc.subject.keyword Policy gradient
- dc.subject.keyword Monotonic improvement
- dc.title Smoothing policies and safe policy gradients
- dc.type info:eu-repo/semantics/article
- dc.type.version info:eu-repo/semantics/publishedVersion