CS代写|强化学习代写Reinforcement learning代考|Sufficient Conditions for Contractivity

In the remainder of this chapter, we characterise in greater generality the behaviour of the sequence of return function estimates described by Equation $4.14$, viewed under the lens of different probability metrics. We begin with a formal definition of what it means for a function $d$ to be a probability metric.
Definition 4.21. A probability metric is an extended metric on the space of probability distributions, written
$$d: \mathscr{P}(\mathbb{R}) \times \mathscr{P}(\mathbb{R}) \rightarrow[0, \infty] .$$

Its supremum extension is the function $\bar{d}: \mathscr{P}(\mathbb{R})^{\mathcal{X}} \times \mathscr{P}(\mathbb{R})^{\mathcal{X}} \rightarrow \mathbb{R}$ defined as
$$\bar{d}\left(\eta, \eta^{\prime}\right)=\sup _{x \in \mathcal{X}} d\left(\eta(x), \eta^{\prime}(x)\right) .$$
We refer to $\bar{d}$ as a return-function metric; it is an extended metric on $\mathscr{P}(\mathbb{R})^{\mathcal{X}}$.
Our analysis is based on three properties that a probability metric should possess in order to guarantee contractivity. These three properties relate closely to the three fundamental operations that make up the distributional Bellman operator: scaling, convolution, and mixture of distributions (equivalently: scaling, addition, and indexing of random variables). In this analysis, we will find that some properties are more easily stated in terms of random variables, others in terms of probability distributions. Accordingly, given two probability distributions $\nu, \nu^{\prime}$ with instantiations $Z, Z^{\prime}$, let us overload notation and write
$$d\left(Z, Z^{\prime}\right)=d\left(\nu, \nu^{\prime}\right)$$

CS代写|强化学习代写Reinforcement learning代考|A Matter of Domain

Suppose that we have demonstrated, by means of Theorem 4.25, that the distributional Bellman operator is a contraction mapping in the supremum extension of some probability metric $d$. Is this sufficient to guarantee that the sequence
$$\eta_{k+1}=\mathcal{T}^\pi \eta_k$$
converges to the return function $\eta^\pi$, by means of Proposition 4.7? In general, no, because $d$ may assign infinite distances to certain pairs of distributions. To invoke Proposition 4.7, we identify a subset of probability distributions $\mathscr{P}_d(\mathbb{R})$ that are all within finite $d$-distance of each other and then ensure that the distributional Bellman operator is well-behaved on this subset. Specifically, we identify a set of conditions under which
(a) The distributional Bellman operator $\mathcal{T}^\pi$ maps $\mathscr{P}_d(\mathbb{R})^{\mathcal{X}}$ to itself, and
(b) The return function $\eta^\pi$ (the fixed point of $\mathcal{T}^\pi$ ) lies in $\mathscr{P}_d(\mathbb{R})^{\mathcal{X}}$.
For most common probability metrics and natural problem settings, these requirements are easily verified. In Proposition 4.16, for example, we demonstrated that under the assumption that the reward distributions are bounded, then Proposition $4.7$ can be applied with the Wasserstein distances. The aim of this section is to extend the analysis to a broader set of probability metrics, but also a greater number of problem settings, including those where the reward distributions are not bounded.

