In nature we find all kinds of multi-agent systems sustained upon cooperative behaviours.
In this work, we study multi-agent systems by means of the Stag-Hunt
game, which presents a conflict between mutual benefit and personal risk. In particular,
we consider the probabilistic inference approach for reinforcement learning
on a grid-based variant of this game. We analyze the behavior of two different
policy gradient algorithms in the presence of function approximation: the standard
REINFORCE ...
In nature we find all kinds of multi-agent systems sustained upon cooperative behaviours.
In this work, we study multi-agent systems by means of the Stag-Hunt
game, which presents a conflict between mutual benefit and personal risk. In particular,
we consider the probabilistic inference approach for reinforcement learning
on a grid-based variant of this game. We analyze the behavior of two different
policy gradient algorithms in the presence of function approximation: the standard
REINFORCE algorithm and the Cross-Entropy (CE) method, which differ on the
functional form of the loss. However, even though both REINFORCE and CE share
the same global optimal solution, we have found that REINFORCE behaves too
greedily compared with CE. In agreement with previous results based on probabilistic
graphical models, we obtain two different qualitative optimal solutions (riskand
payoff-dominant) as a function of a temperature parameter, whose transition
is better observed using the CE method. We also analyze the difference between
using or not path-cost, in addition to the end-cost. It is known that adding pathcost
makes the problem harder using an explicit probabilistic graphical model, since
it increases its tree-width. Nevertheless, we observe the opposite effect for policy
gradient methods, for which path-cost enhances the performance of the resulting
controls in all circumstances. This is explained because the samples used by policy
gradients are generally more informed with path-cost. Finally, we also consider a
distributed version of the algorithm, with partial observability and feature sharing
between the agents. In this setting, we show the feasibility of generalizing to larger
grids using training data from smaller grids.
+