# Design Details of MARVEL

## CS代写|机器学习代写Machine Learning代考|DRL Framework

The form of discounted sum is widely adopted in RL because it can cater to both the overall performance of a strategy and the short-term profit. In MARL model, $i$-th agent maintains an individual policy $\pi^i: S \times A^i \rightarrow[0,1]$, and under this policy, an agent can have a state-value function to evaluate the quality of this policy under a state. Since the policy $\pi^i$ is actually an action choice function, the policy quality is usually denoted as $Q_\pi^i\left(s, a^i\right)$ and defined as follows
$$Q_\pi^i\left(s, a^i\right)=\mathbb{E}\pi\left(\sum{k=0}^{\infty} \gamma_k r_{t+k}^i \mid s_t=s, a_t=a\right),$$
where $\mathbb{E}$ denotes the expectation value, and $k$ denotes the action step. This quality function evaluates the action value taken by a policy under a state. However, obtaining future reward from $k=0$ to $k=\infty$ is not applicable for online learning. To solve this problem, the quality function can also be expressed in an iterative fashion:
$$Q^i\left(s_t, a_t^i\right)=\mathbb{E}\left[r_t, \gamma Q^i\left(s_{t+1}, a_{t+1}^i\right)\right]$$

## CS代写|机器学习代写Machine Learning代考|Training Phase

In this section, we discuss how the agents are trained in an MARL model and present the detailed interaction among agents of MARL. The goal of the training process is to teach each controller whether to export or import switches without human interference. The process is shown in Algorithm 2. In the algorithm, Lines 1-8 calculate the utilization of controllers and select the master controller. In lines 2-5, each controller calculates the resource utilization of all controllers in use. Lines 6-7 select the master controller. In lines 9-23, in each iteration, one controller is selected by the master as an actor controller (lines 10-13), and then the actor controller generates a switch migration action and broadcasts the action to other controllers (lines 14-21). Note that the actor is selected according to its resource utilizationbased probability (line 11). As the training continues, each controller is likely to be selected as the actor based on its resource utilization, and thus the actor can be guaranteed with enough training probability.

