# CS代写|强化学习代写Reinforcement learning代考|COMP5328 Weak Convergence of Return Functions

## CS代写|强化学习代写Reinforcement learning代考|Weak Convergence of Return Functions

Proposition $4.27$ implies that if each distribution $\eta^\pi(x)$ lies in the finite domain $\mathscr{P}d(\mathbb{R})$ of a given probability metric $d$ that is regular, $c$-homogeneous, and $p$-convex, then $\eta^\pi$ is the unique solution to the equation $$\eta=\mathcal{T}^\pi \eta$$ in the space $\mathscr{P}_d(\mathbb{R})^{\mathcal{X}}$. It does not, however, rule out the existence of solutions outside this space. This concern can be addressed by showing that for any $\eta_0 \in \mathscr{P}(\mathbb{R})^{\mathcal{X}}$, the sequence of probability distributions $\left(\eta_k(x)\right){k \geq 0}$ defined by
$$\eta_{k+1}=\mathcal{T}^\pi \eta_k$$
converges weakly to the return distribution $\eta^\pi(x)$, for each state $x \in \mathcal{X}$. In addition to giving an alternative perspective on the quantitative convergence results of these iterates, the uniqueness of $\eta^\pi$ as a solution to Equation $4.18$ (stated as Proposition 4.9) follows immediately from Proposition $4.34$ below.

## CS代写|强化学习代写Reinforcement learning代考|Random Variable Bellman Operators

In this chapter, we defined the distributional Bellman operator $\mathcal{T}^\pi$ as a mapping on the space of return-distribution functions $\mathscr{P}(\mathbb{R})^{\mathcal{X}}$. We also saw that the action of the operator on a return function $\eta \in \mathscr{P}(\mathbb{R})^{\mathcal{X}}$ can be understood both through direct manipulation of the probability distributions or through manipulation of a collection of random variables instantiating these distributions.

Viewing the operator through its effect on the distribution of a collection of representative random variables is a useful tool for understanding distributional reinforcement learning, and may prompt the reader to ask whether it is possible to avoid referring to probability distributions at all, working instead directly with random variables. We describe one approach to this below using the tools of probability theory, and then discuss some of its shortcomings.

Let $G_0=\left(G_0(x): x \in \mathcal{X}\right)$ be an initial collection of real-valued random variables, indexed by state, supported on a probability space $\left(\Omega_0, \mathscr{F}0, \mathbb{P}_0\right)$. For each $k \in \mathbb{N}^{+}$, let $\left(\Omega_k, \mathscr{F}_k, \mathbb{P}_k\right)$ be another probability space, supporting a collection of random variables $\left(\left(A_k(x), R_k(x, a), X_k^{\prime}(x, a)\right): x \in \mathcal{X}, a \in \mathcal{A}\right)$, with $A_k(x) \sim \pi(\cdot \mid x)$, and independently $R_k(x, a) \sim P{\mathcal{R}}(\cdot \mid x, a), X_k(x, a) \sim P_{\mathcal{X}}(\cdot \mid x, a)$. We then consider the product probability space on $\Omega=\prod_{k \in \mathbb{N}} \Omega_k$. All random variables defined above can naturally be viewed as functions on this joint probability space, that depend on $\omega=\left(\omega_0, \omega_1, \omega_2, \ldots\right) \in \Omega$ only through the coordinate $\omega_k$ that matches the index $k$ on the random variable. Note that under this construction, all random variables with distinct indices are independent.

Now define $\mathscr{X}{\mathbb{N}}$ as the set of real-valued random variables on $(\Omega, \mathscr{F}, \mathbb{P}$ ) (where $\mathscr{F}$ is the product $\sigma$-algebra) that depend on only finitely-many coordinates of $\omega \in \Omega$. We can define a Bellman operator $\mathcal{T}^\pi: \mathscr{X}{\mathbb{N}} \rightarrow \mathscr{X}{\mathbb{N}}$ as follows. Given $G=(G(x): x \in \mathcal{X}) \in \mathscr{X}{\mathbb{N}}^{\mathcal{X}}$, let $K \in \mathbb{N}$ be the smallest integer such that the random variables $(G(x): x \in \mathcal{X})$ depend on $\omega=\left(\omega_0, \omega_1, \omega_2, \ldots\right) \in \Omega$ only through $\omega_0, \ldots, \omega_{K-1} ;$ such an integer exists due to the definition of $\mathscr{X}{\mathbb{N}}$ and the finiteness of $\mathcal{X}$. We then define $\mathcal{T}^\pi G \in \mathscr{X}{\mathbb{N}}$ by
$$\left(\mathcal{T}^\pi G\right)(x)=R_K\left(x, A_K(x)\right)+\gamma G\left(X_K^{\prime}\left(x, A_K(x)\right) .\right.$$

## CS代写|强化学习代写Reinforcement learning代考|Random Variable Bellman Operators

$\Omega=\prod_{k \in \mathbb{N}} \Omega_k$. 上面定义的所有随机变量目然可以看作是这个联合概率空间上的函数，它依赖于 $\omega=\left(\omega_0, \omega_1, \omega_2, \ldots\right) \in \Omega$ 只 能通过坐标 $\omega_k$ 与索引|配的 $k$ 关于随机变量。请注意，在伩种结构下，所有具有不同索引的随机变量都是独立的。

$$\left(\mathcal{T}^\pi G\right)(x)=R_K\left(x, A_K(x)\right)+\gamma G\left(X_K^{\prime}\left(x, A_K(x)\right) .\right.$$

