CS代写|强化学习代写Reinforcement learning代考|CSE546 Tetris

CS代写|强化学习代写Reinforcement learning代考|Tetris

Tetris

State: the current board, the current falling tile

Action: Rotate or shift the falling shape

One-step reward: if a level is cleared by the current action, reward 1 , o.w. reward 0 ;

Transition Probability: future tile is uniformly distributed

Discount factor: $\gamma=1$

Maze

Rewards: -1 per time-step

Actions: N, E, S, W

States: Agent’s location

Markov Property
A state $s_t$ is Markov iff
$$P\left(s_{t+1} \mid s_t\right)=P\left(s_{t+1} \mid s_1, \ldots, s_t\right)$$

the state captures all relevant information from the history

once the state is known, the history may be thrown away

i.e. the state is a sufficient statistic of the future

Policy

• Stochastic policy
$$\pi(a \mid s)=\mathbb{P}\left(a_t=a \mid s_t=s\right)$$
• Policy $\pi$ defines the behavior of an agent
• For MDP, the policy depends on the current state(Markov property)
• Deterministic policy: $\pi(a \mid s)=\mathbb{P}\left(a_t=a \mid s_t=s\right)=1$

• 随机策略
$$\pi(a \mid s)=\mathbb{P}\left(a_t=a \mid s_t=s\right)$$
• 政策 $\pi$ 定义代理的行为
• 对于 MDP，策略取决于当前状态 (马尔可夫属性)
• 确定性政策: $\pi(a \mid s)=\mathbb{P}\left(a_t=a \mid s_t=s\right)=1$

