Provably efficient offline reinforcement learning in regular decision processes

Mostra el registre complet Registre parcial de l'ítem

  • dc.contributor.author Cipollone, Roberto
  • dc.contributor.author Ronca, Alessandro
  • dc.contributor.author Jonsson, Anders
  • dc.contributor.author Sadegh Talebi, Mohammad
  • dc.date.accessioned 2025-01-27T13:54:03Z
  • dc.date.available 2025-01-27T13:54:03Z
  • dc.date.issued 2023
  • dc.description.abstract This paper deals with offline (or batch) Reinforcement Learning (RL) in episodic Regular Decision Processes (RDPs). RDPs are the subclass of Non-Markov Decision Processes where the dependency on the history of past events can be captured by a finite-state automaton. We consider a setting where the automaton that underlies the RDP is unknown, and a learner strives to learn a near-optimal policy using pre-collected data, in the form of non-Markov sequences of observations, without further exploration. We present RegORL, an algorithm that suitably combines automata learning techniques and state-of-the-art algorithms for offline RL in MDPs. RegORL has a modular design allowing one to use any off-the-shelf offline RL algorithm in MDPs. We report a non-asymptotic high-probability sample complexity bound for RegORL to yield an ε-optimal policy, which makes appear a notion of concentrability relevant for RDPs. Furthermore, we present a sample complexity lower bound for offline RL in RDPs. To our best knowledge, this is the first work presenting a provably efficient algorithm for offline learning in RDPs.
  • dc.description.sponsorship Roberto Cipollone is partially supported by the EU H2020 project AIPlan4EU (No. 101016442), the ERC-ADG White-Mech (No. 834228), the EU ICT-48 2020 project TAILOR (No. 952215), the PRIN project RIPER (No. 20203FFYLK), and the PNRR MUR project FAIR (No. PE0000013). Anders Jonsson is partially supported by the EU ICT-48 2020 project TAILOR (No. 952215), AGAUR SGR, and the Spanish grant PID2019-108141GB-I00. Alessandro Ronca is partially supported by the ERC project WhiteMech (No. 834228), and the ERC project ARiAT (No. 852769). Mohammad Sadegh Talebi is partially supported by the Independent Research Fund Denmark, grant number 1026-00397B.
  • dc.format.mimetype application/pdf
  • dc.identifier.citation Cipollone R, Ronca A, Jonsson A, Talebi MS. Provably efficient offline reinforcement learning in regular decision processes. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, editors. Advances in Neural Information Processing Systems 36 (NeuroIPS 2023); 2023 Dec 10-16; New Orleans.
  • dc.identifier.doi https://doi.org/10.48550/arXiv.2412.19194
  • dc.identifier.uri http://hdl.handle.net/10230/69308
  • dc.language.iso eng
  • dc.publisher Oxford University Press
  • dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/101016442
  • dc.relation.projectID info:eu-repo/grantAgreement/H2020/834228
  • dc.relation.projectID info:eu-repo/grantAgreement/H2020/952215
  • dc.relation.projectID info:eu-repo/grantAgreement/H2020/852769
  • dc.relation.projectID info:eu-repo/grantAgreement/ES/2PE/PID2019-108141GB-I00
  • dc.rights Copyright © (2024) by individual authors and Neural Information Processing Systems Foundation Inc. All rights reserved. This is the accepted manuscript version of the paper. The final version is available online from the Neural Information Processing Systems Foundation at: https://proceedings.neurips.cc/paper_files/paper/2023/hash/7bf3e93543a612b75b6373178ba1faa4-Abstract-Conference.html
  • dc.rights.accessRights info:eu-repo/semantics/openAccess
  • dc.subject.keyword Reinforcement Learning
  • dc.subject.keyword Regular decision processes
  • dc.subject.keyword Non-Markov decision processes
  • dc.subject.keyword RegORL
  • dc.title Provably efficient offline reinforcement learning in regular decision processes
  • dc.type info:eu-repo/semantics/conferenceObject
  • dc.type.version info:eu-repo/semantics/acceptedVersion