## CS代写|强化学习代写Reinforcement learning代考|Risk-Neutral Control

The problem of finding a policy that maximises the agent’s expected return is called the risk-neutral control problem, as it is insensitive to the deviations of returns from their mean. We have already encountered risk-neutral control when we introduced the Q-learning algorithm in Section 3.7. We begin this chapter by providing a theoretical justification for this algorithm.

Problem 7.1 (Risk-neutral control). Given an $\operatorname{MDP}\left(\mathcal{X}, \mathcal{A}, \xi_0, P_{\mathcal{X}}, P_{\mathcal{R}}\right)$ and discount factor $\gamma \in[0,1)$, find a policy $\pi$ maximising the objective function
$$J(\pi)=\mathbb{E}\pi\left[\sum{t=0}^{\infty} \gamma^t R_t\right] .$$
A solution $\pi^*$ that maximises $J$ is called an optimal policy.
Implicit in the definition of risk-neutral control and our definition of a policy in Chapter 2 is the fact that the objective $J$ is maximised by a policy that only depends on the current state, that is one that takes the form
$$\pi: \mathcal{X} \rightarrow \mathscr{P}(\mathcal{A})$$

## CS代写|强化学习代写Reinforcement learning代考|Value Iteration and Q-Learning

The main consequence of Proposition $7.2$ is that when optimising the riskneutral objective we can restrict our attention to deterministic stationary Markov policies. In turn, this makes it possible to find an optimal policy $\pi^$ by computing the optimal state-action value function $Q^$, defined as
$$Q^(x, a)=\sup {\pi \in \pi{\mathrm{MS}}} \mathbb{E}\pi\left[\sum{t=0}^{\infty} \gamma^t R_t \mid X=x, A=a\right] .$$
Just as the value function $V^\pi$ for a given policy $\pi$ satisfies the Bellman equation, $Q^$ satisfies the Bellman optimality equation:
$$Q^(x, a)=\mathbb{E}\left[R+\gamma \max {d \in \mathcal{A}} Q^\left(X^{\prime}, a^{\prime}\right) \mid X=x, A=a\right] .$$
The optimal state-action value function describes the expected return obtained by acting so as to maximise the risk-neutral objective when beginning from the state-action pair $(x, a)$. Intuitively, we may understand Equation $7.3$ as describing this maximising behaviour recursively. While there might be multiple optimal policies, they must (by definition) achieve the same objective value in Problem 7.1. This value is
$$\mathbb{E}\pi\left[V^\left(X_0\right)\right],$$ where $V^$ is the optimal value function:
$$V^(x)=\max _{a \in \mathcal{A}} Q^(x, a) .$$

## CS代写|强化学习代写|风险中性控制

$$J(\pi)=mathbb{E}。\pi\left[sum t=0^{infty} \gamma^t R_t\right]$$

$$\pi: mathcal{X} rightarrowmathscr{P}(mathcal{A})$$

## CS代写|强化学习代写|Value Iteration and QLearning

$$\mathbb{E} \pi\left[V^{\left(X_0\right)}\right],$$

$$\δleft.δleft.V^{(}x\right)=\max _{a\in δmathcal{A}}. Q^{(} x, a\right)$$

