Does Reasoning Enhance Learning? * Nicolaas J. Vriend, Universitat Pompeu Fabra, Barcelona, Spain October 1996 Abstract: Utilizing the well-known Ultimatum Game, this note presents the following phenomenon. If we start with simple stimulus-response agents, learning through naive reinforcement, and then grant them some introspective capabilities, we get outcomes that are not closer but farther away from the fully introspective game-theoretic approach. The cause of this is the following: there is an asymmetry in the information that agents can deduce from their experience, and this leads to a bias in their learning process. J.E.L. classification codes: C7, D8 Keywords: Ultimatum Game, Game Theory, Reasoning, Reinforcement Learning N.J. Vriend, Universitat Pompeu Fabra, Dept. Economics, Ramon Trias Fargas 25-27, 08005 Barcelona, Spain, , http://www.santafe.edu/~vriend/. * I wish to thank Rosemarie Nagel and Al Roth for comments on earlier thoughts along these lines. All errors and responsibilities are mine. 1. Introduction One of the games most extensively studied in the literature during the last decade is the Ultimatum Game. The reason that this game is so intriguing seems to be that the game-theoretic analysis is straightforward and simple, while the overwhelming experimental evidence is equally straightforward but at odds with the game-theoretic analysis (see e.g., Güth et al. [1982], Güth & Tietz [1990], or Thaler [1988]). Here is the basic form of the Ultimatum Game: there are two players, A and B, and a cake. Player A proposes how to split the cake between herself and player B. Upon receiving player A’s proposal, player B has two options. First, to accept the proposal, which will then be executed. Second, to reject it, after which both get nothing. Many different variants of this basic setup have been considered in the literature, but the details are not essential for this note. There are many Nash equilibria in this game. Every strategy for player A combined with any strategy for player B that accepts that offer but rejects all worse offers is one. But, considering the extensive form variant of the game, there is a unique subgame perfect equilibrium: player A offers the minimum portion, and player B accepts that. Empirical evidence shows time and again that this is not what happens in most games played in the laboratory. Players A usually offer slightly less than half the cake to players B. And players B usually reject small offers. In this note we will concentrate on the behavior of players A, and assume that players B play the perfect equilibrium strategy. Concerning player A’s behavior, there are two main explanations for the ’anomaly’ offered in the literature. First, some argued that fairness and reciprocity considerations were the force driving players A to offer more than the game-theoretic approach would suggest to players B (see e.g., Forsythe et al. [1994]). A second explanation found in the literature is that players A are basically following an adaptive, best-reply seeking, approach to the behavior of players B. In a multi-period setup where players played the game repeatedly but against different players, some papers showed how it can happen that players A can unlearn to play the perfect equilibrium strategy before players B learn that they should play their perfect equilibrium strategy. And once players A do not play that strategy anymore, players B will never learn to play theirs (see Roth & Erev [1995] who apply a reinforcement learning approach, while Gale et al. [1995] use replicator dynamics).1 This note offers an additional possible explanation. If we start with simple stimulus-response agents, learning through naive reinforcement (see next section), and then grant them some introspective capabilities, we get outcomes that are not closer but farther away from the fully 1 For a detailed comparative analysis of replicator dynamics and reinforcement learning, on which we focus in this note, see Börgers & Sarin [1995]. 2introspective game-theoretic approach. The cause of this is the following: there is an asymmetry in the information that agents can deduce from their experience, and this leads to a bias in their learning process. It is a simple point, but I have not seen it made in the literature. Section 2 will explain what reinforcement learning is, and the difference between naive reinforcement learning, and reinforcement learning plus reasoning. Section 3 shows how an information asymmetry leads to a distortion of the learning process. Section 4 discusses and relaxes the assumptions made in section 3, and section 5 will conclude. 2. Actual and Virtual Reinforcement Learning Standard game theory is completely based on introspection. Given the agents’ reasoning capabilities, actually playing games seems inessential.2 If we imagine a rationality axis, the fully introspective players of standard game theory would be at one of the extremes of this axis. Close to the opposite end of the spectrum we would find boundedly rational agents without any reasoning capabilities. They act like conditioned animals, or like simple stimulus-response machines. These agents behave adaptively to their environment in that they experiment by trying actions, and actions that led in the past to better outcomes are more likely to be chosen again in the future. This type of behavior is nowadays known as reinforcement learning. There is a family of stochastic dynamic models of such individual behavior in the scientific literature, for which different backgrounds can be distinguished. The idea was first developed in the psychological literature. See especially Hull [1943], and Bush & Mosteller [1955], on which Cross [1983] is based. Much later reinforcement learning was independently reinvented twice as a machine learning approach in computer science. See e.g., Barto at al. [1983], and Sutton [1992] for a survey. The other reinforcement learning approach in computer science is known as Classifier Systems. See Holland [1975] for early ideas on this, or Holland et al. [1986] for a much more elaborate treatment. In the economics literature reinforcement learning became better known only very recently through Roth & Erev [1995]. For expositional convenience, in this note we will follow the Roth & Erev approach to reinforcement learning.3 2 Here is an unverified anecdote illustrating this. A famous economist who had just published a book about the game of Bridge, met a colleague who expressed his surprise about him being a player of this game. "I don’t play Bridge", was his reaction, "I wrote a book about it". 3 Reinforcement learning differs from some other important recent models of dynamics with experimentation in the economics literature (see e.g., Ellison [1993], Kandori et al. [1993], or Young [1993]). In those evolutionary dynamic models, adaptive behavior is basically a one-step error correction mechanism. The agents have a well-specified model of the game, they can reason what the optimal action would be, given the actions of the other players, completely independent from any payoff actually (continued...) 3In computer science the main reason for the interest in reinforcement learning is its success in performing difficult tasks. In psychology reinforcement learning is mainly judged on its success in explaining empirical evidence of human subjects in experiments. And the fact that this kind of model seems consistent with the robust properties of learning that have been observed in the large experimental psychology literature is also the main line of thought in the application of reinforcement learning models in the experimental economics literature (although Roth & Erev have some deeper ideas, on which more below). Besides these reasons to take reinforcement learning serious, we take this general class of models as a starting point because it concerns the most basic form of learning. It is the simplest possible type of learning model, assuming the most limited cognitive capabilities, no reasoning or introspection at all, while still allowing for learning. Hence, besides the standard game- theoretic approach, this kind of model functions as a second bench-mark. Consider the Ultimatum Game. Assume player A has no knowledge of game theory, the structure of the Ultimatum Game, or the behavior of players B. Player A is a boundedly rational agent who behaves adaptively to her environment. She simply tries actions, and is in the future more likely to repeat those actions that led to high payoffs in the past than those that led to low payoffs. One can imagine player A as playing with a multi-armed bandit, where different arms might give different payoffs, and player A does not know at the start which is the best arm to pull. This is the basic reinforcement learning approach (see Roth & Erev [1995]): At time t=1 each player has an initial propensity to play his kth pure strategy given by some real number qk(1). If a player plays his kth pure strategy at time t, and receives a payoff of z, then the propensity to play strategy k is updated by setting qk(t+1)=qk(t)+z, while for all other pure strategies j, qj(t+1)=qj(t). The probability that the player plays his kth pure strategy at time t is pk(t) = qk(t)/Σqj(t), where the sum is over all of the player’s pure strategies j. We will call this naive reinforcement learning, or learning through actual reinforcement. There are two obvious ways to extend the basic reinforcement learning mechanism. First, instead of learning only on the basis of his own actions and payoffs, an agent can learn from 3(...continued) experienced, and they play a best-response strategy against the frequency distribution of a given (sub-)population of other players. The evolutionary dynamics consist of a co-evolutionary adaptive process, players adapting to each others’ adaptation to each other ..., plus experimentation in the form of trembling. The very first task for a reinforcement learning algorithm, on the other hand, is to learn what would be good actions. The agents do not have a well-specified model of their environment, and they do not know which action would be the best response. See Vriend [1994] for a more extensive discussion of these issues. 4other agents’ actions and payoffs. He can, then, reinforce his propensities to choose an action as if he had tried those actions, and experienced the payoffs himself. This is called learning through vicarious reinforcement (see Bandura [1977]).4 We will not pursue that track here. Second, instead of learning on the basis of actions and payoffs actually experienced by himself or others, an agent can learn by reasoning about counterfactuals, i.e., about hypothetical actions and payoffs. He can, then, reinforce his propensities to choose an action as if he had actually tried those actions, and experienced the payoffs. We will call this learning through virtual reinforcement (see Holland [1990]). Clearly, virtual reinforcement requires that one relaxes the restriction on player A’s cognitive capabilities, and assumes that player A has some reasoning capabilities. In the Ultimatum Game the arms of the bandit can be ordered naturally from low offers to high offers. For example, when an offer x has been accepted by player B, player A can reason that player B would have accepted also all offers x’>x. Or when an offer x has been rejected, player A can reason that offers x"x, but player A does not get information about what player B would have done with offers x"x. 3. Effect of the Information Asymmetry: the Basic Model We will now analyze the effect of this information asymmetry on the reinforcement learning process. What we will do is start with some extremely simplifying assumptions to make our point. In section 4 the basic assumptions of table 1 will be discussed and relaxed. 4 Notice that this is more sophisticated than imitation. It is only the propensity to choose an action that is updated on the basis of an other agent’s action and consequence. 5i) The cake has size Π. ii) The only possible offers x to player B are 0, 1, 2, ..., Π. iii) Player B plays the perfect equilibrium strategy, and accepts every offer.5 iv) Player A tries every action equally often, say n times. v) In case of acceptance the payoff to player A is simply Π-x. In case of rejection it would be 0. vi) Reinforcement takes place as explained above: the payoff realized by playing offer x is simply added to the propensity to choose offer x. vii) Only actual reinforcement learning takes place. Table 1 basic assumptions Proposition 1: Under assumptions i) to vii), the most reinforced offer will be x=0. Proof: After player A has tried each possible offer n times, the propensity to choose a certain offer x will have increased with n times a payoff of (Π−x). Formally, r(x) = n (Π−x), which has a maximum at x=0. Next, we assume that player A reasons about counterfactuals as well. Hence we keep assumptions i) to vi), but instead of vii), we make the assumption given in table 2 vii’) Player A reasons that player B’s behavior has a reservation value property. If offer x is accepted, offers x’>x would have been accepted too. Therefore, player A applies both actual and virtual reinforcement. Table 2 assumption virtual reinforcement Proposition 2: Under assumptions i) to vi) plus vii’), the most reinforced offer will be offer x satisfying x>(Π−2)/2 while (x−1)≤(Π−2)/2. Proof: After player A has tried each possible offer n times, the propensity to choose a certain offer x will have increased with n times a payoff of (Π−x) through the actual reinforcement. In addition, the virtual reinforcement for an offer x will be: x times this actual reinforcement (we will give the intuition below). Hence, the total reinforcement for offer x is: r(x) = n (x+1) (Π−x). Taking first differences gives: r(x+1)−r(x) = n (Π−2x−2). Hence, ∆r(x)<0 if x>(Π−2)/2. Notice that the only difference with the case of only actual reinforcement is the factor (x+1). Since (x+1)≥1, all propensities are updated more often now. That is, the virtual updating in addition to the actual updating makes the learning process faster. But there is also a second effect: the information asymmetry. Table 3 helps to explain this. 5 To simplify the presentation we assume that a 0 offer is accepted as well. 6offer x to player B total reinforcement r(x) 0 n (0+1) (Π−0) 1 n (1+1) (Π−1) 2 n (2+1) (Π−2) ... ... ... ... Π n (Π+1) (Π−Π) Table 3 actual plus virtual reinforcement Consider the first line in table 3, i.e., the offer x=0. The propensity of this offer will be updated only when x=0 has actually been accepted. If other offers x>0 are accepted, player A will not know whether x=0 would have been acceptable too. Next, consider the offer x=1. This propensity will be reinforced actually when x=1 has been accepted, and in addition virtually when x=0 has been accepted. One can continue this up to the largest possible offer x=Π. If x=Π has been accepted by player B, player A will update this propensity actually. But player A will also reason after every offer x<Π that an offer x=Π would have been acceptable as well. Hence, x=Π will be updated (Π+1) times as often as x=0. In other words, the first direct effect of the information asymmetry is that some offers are reinforced more often than other offers, although they are actually tried the same number of times. What does this mean for the final propensities to choose these offers? Formally, the total reinforcement for offer x is: r(x) = n (Π−x) (x+1). Taking first differences gives: r(x+1)−r(x) = n (Π−2x−2). Hence, ∆r(x)<0 if x>(Π−2)/2. In other words, applying both actual and virtual reinforcement, the most reinforced offer x will be slightly below the 50-50 offer. The example in figure 1 for n=1, and Π=6 illustrates this. The graph shows the difference between the two forms of learning. In one case only actual reinforcement takes place, while the other case allows for virtual reinforcement as well. In the first case the most reinforced offer is 0, while in the second case the most reinforced offers are 2 and 3.6 6 Such offers are very common in laboratory experiments with a cake of size 6. 70 2 4 6 8 10 12 0 1 2 3 4 5 6 reinforcement type offer reinforcement only actual actual + virtual Figure 1 outcomes two types of reinforcement process Before relaxing the assumptions we made in the next section, let us stress that the fact that offers slightly below half the cake size are most reinforced has nothing to do with fairness or reciprocity. Player A is assumed to be so boundedly rational that such concepts form no part of her cognitive capabilities. It has also nothing to do with best-responses to player B’s behavior. In our exposition thus far player B accepts all offers made to him. The fact that offers slightly below half the cake size are most reinforced is due here entirely to the asymmetry in the information player A deduces from player B’s acceptances. 4. Discussion of the Assumptions As said above the simplifying assumptions made thus far were made in order to highlight the information asymmetry effect. We will now discuss these assumptions, and show how they can be relaxed without canceling this effect. ad i) The are some papers in the literature in which the cake size Π is variable and not known with certainty (see e.g., Mitzkewitz & Nagel [1993]). We can allow for such a relaxation. It will not annul the effect of the asymmetry as such, but only change the numbers somewhat. ad ii) Assuming that only integer offers up to Π can be made is without much loss of generality. As long as we assume there exists a minimum slice size, all that is involved is re- scaling that indivisible unit to 1. ad iii) Assuming player B accepts all offers helps to distinguish the information asymmetry effect from the other explanations found in the literature concerning the empirical evidence. In general player B will not accept any offer; not even every offer with equal probability. But this 8does not deny the asymmetry effect. And if, what seems most plausible, lower offers are rejected more often than higher offers, it only reinforces that effect (away from the perfect equilibrium). Adding the possibility of refusals, one should notice that there is a analogous asymmetry effect with refusals. If offer x to player B has been refused, player A can reason so would have been offers x"x. However, the standard assumption is that in case of refusal the payoff and hence the reinforcement is zero. Hence, this does not influence our result. We will come back to this point of reinforcement after refusals below. ad iv) Player A will not try every strategy equally often, but will adapt towards the more reinforced ones. This relaxation will only reinforce the asymmetry effect. Making small offers to player B will get reinforced even less and less often. ad v) Reinforcement to player A is not necessarily linear in the payoff in cake. It seems a harmless assumption. Changing it would change the numbers somewhat, but not cancel the asymmetry effect. If we want to take negative reinforcement in case of refusals into account as well, it only reinforces our argument, as it can be shown that the information asymmetry in case of rejections works in the same direction as the one discussed for the case of acceptance. add vi) Suppose we do still do not update offers x"x are presumed to be more likely accepted than offers x"