# Models of learning in dynamic situations

## 经济代写|微观经济学代考Microeconomics代写|Models of learning in dynamic situations

If, while retaining the hypothesis of a repeated decision problem, we move from a static decision problem to a dynamic one, two types of learning models can be envisaged. Firstly, we can continue to apply the above models of learning while adapting them to a dynamic context. One possibility consists in translating the decision problem, expressed in extensive form, into a normal form by the introduction of strategies of the decision maker and then applying the above methods to the strategies. Thus, the CPR model is applicable to the decision-maker’s strategies when their performances can be observed. Another possibility is to keep the decision problem in an extensive form, but to apply the above methods to each node of the decision tree. Hence, the CPR model is applicable by considering that, for each successive occurrence in the decision process, the utility obtained by the decision-maker is attributed simultaneously to all the actions appearing in the trajectory followed in the decision tree. Secondly, we can draw directly on the classical rules of choice proposed for dynamic decision situations. This is all the more necessary as these choice rules, based on the backward induction procedure, require high capacities for the processing of information (Sutton-Barto, 1998).

A model of learning proposed early in Artificial Intelligence is the “Qlearning model” (Watkins, 1989), which applies to a stochastic decision process. A reinforcement model, it does not presuppose a priori knowledge of the characteristics of the decision process (probabilities and utilities of transition), although such knowledge helps to accelerate the process. This model leads to revision of “expected local utilities” $U_h^i$ each time the decision maker uses the action $i$ in the configuration $h$ (which he does for the $n_h^i$ th time) to find himself in the configuration $k$, obtaining the utility $u_{u k}^i$. The rule of revision is adapted from the Bellman equation and is written:
$$\Delta \mathrm{U}{\mathrm{h}}^{\mathrm{i}}=a\left(n_h^i\right)\left[\delta U_k+u{h k}^i-U_h^i\right]$$
where $a\left(n_h^i\right)$ is a decreasing averaging function (often $\left.a(n)=1 / n\right)$.

## 经济代写|微观经济学代考Microeconomics代写|Associated models

Local strategies, which associate an action $i$ with each configuration $h$, can be generalised in the form of “rules” or “classifiers” (Holland, 1987). In this case, a rule associates an action $Y_i$ (possibly pluridimensional) with a set of configurations $X_h$ following the principle: “if condition $X_h$, then action $Y_i$ “. The condition of the rule groups together the configurations between which the decision-maker makes no distinction, either because of an error in perception on his part or because the action involved does not require any distinction to be made. It can be considered as an operation of categorisation performed by the decision-maker and therefore expresses the degree of granularity with which he apprehends his environment in relation to the action. A rule is activated by the decision-maker if one of the configurations of its condition is actually produced. Of course, several rules may be activated in the same configuration, in which case they find themselves in competition. Moreover, certain rules will be used in a chain to obtain a certain result.

To each rule is attributed a utility or “force” $U_h^i$ which evolves over the passage of time according to an algorithm close to Q-learning, the algorithm of the “chain of bearers”. In each configuration $h$, the admissible rules make “bids” $\mu U_h^i$ and one of them is chosen with a probability dependent on its bid:
$$p_h^i \propto e^{\mu U_h^i}$$
This rule loses its bid, but receives a reward from two sources:

• from the external environment (if the rule acts on the external environment through the action $i$ by providing a utility $u_h^i$
$$\Delta U_h^i=u_h^i-\mu U_h^i$$
• from the internal environment (if the rule acts on the internal environment by causing transition to the state $k$, thus triggering a new rule, of which the action is $j$ and from which it receives the bid):
$$\Delta U_h^i=\mu U_k^j-\mu U_h^i$$

