# CS代写|强化学习代写Reinforcement learning代考|CS59300 Relationship With Distributional Dynamic Programming

## CS代写|强化学习代写Reinforcement learning代考|Relationship With Distributional Dynamic Programming

In Chapter 5 we introduced distributional dynamic programming (DDP) as a class of methods that operates over return-distribution functions. In fact, every statistical functional dynamic programming is also a DDP algorithm (but not the other way around; see Exercise 8.8). This relationship is established by considering the implied representation
$$\mathscr{F}=\left{\iota(s): s \in I_\psi\right} \subseteq \mathscr{P}(\mathbb{R})$$
and the projection $\Pi_{\mathscr{F}}=\iota \circ \psi$ (see Figure 8.3).

From this correspondence, we may establish the relationship between Bellman closedness and the notion of a diffusion-free projection developed in Chapter 5.

Proposition 8.17. Let $\psi$ be a Bellman-closed sketch. Then for any choice of exact imputation strategy $\iota: I_\psi \rightarrow \mathscr{P}\psi(\mathbb{R})$, the projection operator $\Pi{\mathscr{F}}=$ $\iota \psi$ is diffusion-free.
$\triangle$
Proof. We may directly check the diffusion-free property (omitting parentheses for conciseness):
$$\Pi_{\mathscr{F}} \mathcal{T}^\pi \Pi_{\mathscr{F}}=\iota \psi \mathcal{T}^\pi \iota \psi \stackrel{(a)}{=} \iota \mathcal{T}\psi^\pi \psi \iota \psi \stackrel{(b)}{=} \iota \mathcal{T}\psi^\pi \psi \stackrel{(a)}{=} \iota \psi \mathcal{T}^\pi=\Pi_{\mathscr{F}} \mathcal{T}^\pi .$$
where steps marked (a) follow from the identity $\psi \mathcal{T}^\pi=\mathcal{T}_\psi^\pi \psi$, and (b) follows from the identity $\psi \iota \psi=\psi$ for any exact imputation strategy $\iota$ for $\psi$.

## CS代写|强化学习代写Reinforcement learning代考|Expectile Dynamic Programming

Expectiles form a family of statistical functionals parametrised by a level $\tau \in(0,1)$. They extend the notion of the mean of a distribution ( $\tau=0.5)$ similar to how quantiles extend the notion of a median. Expectiles have classically found application in econometrics and finance as a form of risk measure (see the bibliographical remarks for further details). Based on the principles of statistical functional dynamic programming, expectile dynamic programming ${ }^{65}$ uses an approximate imputation strategy in order to iteratively estimate the expectiles of the return function.

Definition 8.18. For a given $\tau \in(0,1)$, the $\tau$-expectile of a distribution $\nu \in$ $\mathscr{P}2(\mathbb{R})$ is $$\psi\tau^{\mathrm{E}}(\nu)=\underset{z \in \mathbb{R}}{\arg \min } \mathrm{ER}\tau(z ; \nu),$$ where $$\mathbb{E R}\tau(z ; \nu)=\underset{Z \sim \nu}{\mathbb{E}}\left[\left|\mathbb{Y}_{{Z<z}}-\tau\right| \times(Z-z)^2\right]$$
is the expectile loss.
The loss appearing in Definition $8.18$ is strongly convex [Boyd and Vandenberghe, 2004] and bounded below by 0 . As a consequence, Equation $8.12$ has a unique minimiser for a given $\tau$; this verifies that the corresponding expectile is uniquely defined.

## CS代写|强化学习代写|强化学习代考|与分布式动态编程的关系

lleft的缺失或未被识别的分隔符

$triangle$

$$\Pi_{mathscr{F}}. \Pi_{T}^pi\Pi_{mathscr{F}}=iota \psi \mathcal{T}^\pi \iota \psi \stackrel{(a)}{=}。\iota\mathcal{T}^pi \psi^pi\psi \iota \psi \stackrel{(b)}{=}。\iota\mathcal{T} \psi^pi \psi \stackrel{(a)}{=}。\iota \psi \mathcal{T}^\pi=\Pi_{mathscr{F}}。\mathcal{T}^pi$$

## CS代写|强化学代可强化学习代考|Expectile Dynamic编程

$$\psi \tau^{mathrm{E}}(nu)=underset{z\in \mathbb{R}}{arg \min }。\ǞǞǞǞ \tau(z; \nu)。$$

$$\mathbb{E} \tau(z; \nu)=\underset{Z\sim \nu}{mathbb{E}}\left[\left|\mathbb{Y}_{Z<z}-tau\right| \times(Z-z)^2\right] 。$$

