## CS代写|强化学习代写Reinforcement learning代考|How Is Distributional Reinforcement Learning Different?

In reinforcement learning, the value function describes the expected return that one would counterfactually obtain from beginning in any given state. It is reasonable to say that its fundamental object of interest – the expected return – is a scalar, and that algorithms that operate on value functions operate on collections of scalars (one per state). On the other hand, the fundamental object of distributional reinforcement learning is a probability distribution over returns: the return distribution. The return distribution characterises the probability of different returns that can be obtained as an agent interacts with its environment from a given state. Distributional reinforcement learning algorithms operate on collections of probability distributions that we call return-distribution functions (or simply return functions).

More than a simple type substitution, going from scalars to probability distributions results in changes across the spectrum of reinforcement learning topics.

In distributional reinforcement learning, equations relating scalars become equations relating random variables. For example, the Bellman equation states that the expected return at a state $x$, denoted equals the expectation of the immediate reward $R$, plus the discounted expected return at the next state $X^{\prime}$ :
$$V^\pi(x)=\mathbb{E}_\pi\left[R+\gamma V^\pi\left(X^{\prime}\right) \mid X=x\right],$$
Here $\pi$ is the agent’s policy – a description of how it chooses actions in different states. By contrast, the distributional Bellman equation states that the random return at a state $x$, denoted $G^\pi(x)$, is itself related to the random immediate reward and the random next-state return according to a distributional equation: ${ }^1$
$$G^\pi(x) \stackrel{\mathcal{D}}{=} R+\gamma G^\pi\left(X^{\prime}\right), \quad X=x .$$

## CS代写|强化学习代写Reinforcement learning代考|Intended Audience and Organisation

This book is intended for advanced undergraduates, graduate students, and researchers who have some exposure to reinforcement learning and are interested in understanding its distributional counterpart. We present core ideas from classical reinforcement learning as they are needed to contextualise distributional topics, but often omit longer discussions and a presentation of specialised methods in order to keep the exposition concise. The reader wishing a more in-depth review of classical reinforcement learning is invited to consult one of the literature’s many excellent books on the topic, including Bertsekas and Tsitsiklis [1996], Szepesvári [2010], Bertsekas [2012], Puterman [2014], Sutton and Barto [2018], Meyn [2022].

Already, an exhaustive treatment of distributional reinforcement learning would require a substantially larger book. Instead, here we emphasise key concepts and challenges of working with return distributions, in a mathematical language that aims to be both technically correct but also easily applied. Our choice of topics is driven by practical considerations (such as scalability in terms of available computational resources), a topic’s relative maturity, and our own domains of expertise. In particular, this book contains only one chapter about what is commonly called the control problem, and focuses on dynamic programming and temporal-difference algorithms over Monte Carlo methods. Where appropriate, in the bibliographical remarks we provide references on these omitted topics. In general, we chose to include proofs when they pertain to major results in the chapter, or are instructive in their own right. We defer the proof of a number of smaller results to exercises.

## CS代写|强化学习代写Reinforcement learning代考|Intended Audience and Organisation

