In real world scenarios, to solve a gran majority of problems, there is the necessity for
different agents to cooperate under the condition of local observations. Fortunately,
in the recent years, significant advances in Multi-Agent Reinforcement Learning
have been done regarding this matter. To tackle this kind of problems, a lot of
approaches are based on the on Centralized Training with Decentralized Execution,
which allow the agents to be trained in a simulated environment where they can
have ...
In real world scenarios, to solve a gran majority of problems, there is the necessity for
different agents to cooperate under the condition of local observations. Fortunately,
in the recent years, significant advances in Multi-Agent Reinforcement Learning
have been done regarding this matter. To tackle this kind of problems, a lot of
approaches are based on the on Centralized Training with Decentralized Execution,
which allow the agents to be trained in a simulated environment where they can
have access to the global information to later solve the problem relying only on local
observations. Some popular methods are Value-Decomposition Networks (VDN) and
QMIX. They undertake the problem by computing the joint action-value function
Qtot as a combination of the individual action-value functions Qa, that only condition
on individual action-observation histories.
Specifically, this work focuses on QMIX, which has been gaining a lot of popularity
in the last year due to its capacity to compute a richer representation of the joint
action-value function than VDN by combining the individual Q-values in a non-linear
approach. However, despite the fact that there have been a lot of improvements on
QMIX, there have been small advances on how different exploration techniques could
boost the learning in this context. In this work, by performing an experimental
evaluation, its shown how some exploration methods outperform the greedy
approach used on the original implementation of QMIX on different cases.
+