## 计算机代写|基础编程代写Fundamental of Programming代考|Graphs

Section 1.2.1 defined a learner as the smallest ‘module’ whereas Sect. 1.2.2 provided a mechanism for organizing learners into teams without prior parameterization for how many learners should appear in a team. ${ }^2$ Different teams might excel at defining policy for different subsets of the state-action sequence. Typically, it is assumed that crossover will provide a sufficient mechanism for recombining the properties from different teams. The underlying premise to this is that the learners when merged using crossover continue to identify unique conditions under which to out-bid other learners. Unfortunately, there is no guarantee that this will be the case. TPG may avoid this condition by enabling a learner to instead reference a different team, thus devolving control to the referenced team under state $\mathbf{s}_t$.

The key to this process is to provide two types of learner action mutation. At initialization all learners are initialized from a discrete set of atomic actions, $a(i) \in A$, specific to the task (e.g. an enumeration of all joystick directions). Thereafter, an action mutation consists of the sequence of tests summarized by Algorithm 1.1. Step 1 determines whether to apply any form of mutation. When true either an action from the set of atomic actions, $A$, is chosen (Step 5) or a pointer to another team, $T$, is established (Step 6). The significance of Step 4 is that it potentially forces a change in the type of action.

## 计算机代写|基础编程代写Fundamental of Programming代考|Memory

The partially observable aspects of the ViZDoom task imply that support for memory is beneficial [22], even with respect to single ViZDoom source tasks [15]. For the purposes of this study, we will adopt the probabilistic indexed memory formulation previously benchmarked under ViZDoom and Dota 2 reinforcement learning environments $[15,21,22]$. In summary, only one instance of indexed memory is retained. This implies that a TPG agent inherits the state of indexed memory left by the previous agent. Indexed memory therefore represents a global internal model of state that is never reset. Registers, $R$, specific to a learner (Sect. 1.2.1) are considered to capture the internal state of each learner. With this in mind, the instruction set is augmented with a write (write (R)) and read $(\mathrm{R}[\mathrm{i}]=\operatorname{read}(\mathrm{k}))$ operation. Write operations are probabilistic, distributing the content of a learner’s registers across $L$ columns of indexed memory. The probability of performing a write is such that locations towards column 1 and $L$ are less likely to be written to (or long term memory). Conversely, locations near $\frac{L}{2}$ are most likely to be written to (or short term memory). Read operations specify a target register, $R[i]$, and an ‘address’ $(k)$ to indexed memory, i.e. $0<k \leq L \times$ MaxReg. Further details of the probabilistic indexed memory model can be found in earlier work $[15,21,22]$.

