This thesis explores the role of uncertainty estimation during training in Reinforce-
ment Learning as a potential way of increasing sample efficiency, acting as a regu-
lator between two subsystems that shape a policy: memory and stimulus-response.
Memory-based subsystems are related to Episodic Reinforcement Learning, where
exact snapshots or sequences of tuples generated during training are stored and then
retrieved to perform the action that maximizes reward based solely on these past
experiences. ...
This thesis explores the role of uncertainty estimation during training in Reinforce-
ment Learning as a potential way of increasing sample efficiency, acting as a regu-
lator between two subsystems that shape a policy: memory and stimulus-response.
Memory-based subsystems are related to Episodic Reinforcement Learning, where
exact snapshots or sequences of tuples generated during training are stored and then
retrieved to perform the action that maximizes reward based solely on these past
experiences. This way of learning is more related to how the hippocampus operates
in the brain. In contrast, stimulus-response subsystems can be expressed as models
that map states to actions in a model-free fashion. In humans and other animals, the
dorsal striatum is responsible for making this stimulus-response mapping. However,
this mapping process does not take into account the inherent uncertainty or variability of stimuli (i.e., perceptual uncertainty) in stochastic environments with partial observability and thus sometimes the optimal policy would be to rely more on the sequential feature of (model-based) memory. Several studies have shown that uncertainty plays a significant role in the decision-making process. Therefore we studied how it can arbitrate between the two systems. Concretely, we used an agent based on the Distributed Adaptive Control (DAC-ML) cognitive architecture comprising the two subsystems and an arbitration module that regulated their respective use based on the entropies of the policies. The agent was trained on a foraging task and showed dynamics that are aligned with human behaviour, where the memory-based system dominates at first, and throughout training, the stimulus-response systemslowly takes over. This research could potentially lead to more flexible and efficient Reinforcement Learning algorithms that combine different ways of learning
and operating depending on the available knowledge about the environment.
+