# 数学代写|运筹学作业代写operational research代考|Markov Decision Processes

## 数学代写|运筹学作业代写operational research代考|Markov Decision Processes

This chapter considers probabilistic sequential decision problems with an infinite planning horizon. Problems with a stochastic state have already been discussed in Section 5.8. Examples are the investment problem (see Section 5.8.4) and dice games (see Sections 5.8.2 and 5.8.5). However, these were problems with a finite horizon. In this chapter, we assume that the horizon is infinite. The associated theory is called Markov decision theory, and the decision problem is called a Markov decision problem.

A Markov decision problem involves a system whose state $X_t$ at time $t \in T=$ ${0,1,2, \ldots}$ is a random variable. The values $X_t$ can take form the state space $S$. We will always assume that $S$ is a countable set, so that we can speak of the state $i \in S={0,1,2, \ldots}$. The process $\left{X_t, t \geq 0\right}$ is observed at every time $t \in T$, these moments are called decision epochs. If $X_t=i$, then decision $a \in D(i)$ is made. The decision space or action space $D(i)$ is assumed to be finite.

If $X_t=i$ and a decision $a_t \in D(i)$ is made, then we receive an immediate reward $r_t(i, a)$, and the process makes a transition to a new state $X_{t+1}=j$ according to the transition probabilities
\begin{aligned} & \mathbb{P}\left(X_{t+1}=j \mid X_0=x_0, B_0=a_0 ; X_1=x_1, B_1=a_1 ; \ldots ; X_t=i, B_t=a_t\right) \ & =\mathbb{P}\left(X_{t+1}=j \mid X_t=i, B_t=a_t\right) \end{aligned}

where $B_t$ is the decision taken at time $t$. This assumption makes that the decision process does not depend on the past of the decision process before time $t$. This process is called a Markov decision process. We make the following assumptions:
(i) we receive an expected immediate reward $r(i, a)$;
(ii) a transition to state $X_{t+1}$ at time $t+1$ occurs according to the transition probabilities
$$p(j \mid i, a)=\mathbb{P}\left(X_{t+1}=j \mid X_t=i, B_t=a\right)$$
It follows from (i) and (ii) that we assume stationarity as both $r(i, a)$ and $p(j \mid i, a)$ do not depend on $t$. This brings us to the following definition.

## 数学代写|运筹学作业代写operational research代考|Markov Decision Processes: Discounted Rewards

In this section, we consider the general Markov decision problem with criterion the maximization of the rewards’ expected present value. The discount factor is again denoted by $\beta$ (with $0<\beta<1$ ). If $C$ is the class of all possible policies, then the optimality criterion reads
$$\max {\pi \in C} V\pi(i) \text { for all } i \in S$$
The value of the objective function after maximization over all possible policies is called the optimal value function and is denoted by $V(i), i \in S$, where $i$ denotes the initial state of the decision process, so
$$V(i)=\max {\pi \in C} V\pi(i), \quad i \in S$$
Example 10.1 (A network model). In this first example, we consider the special case with deterministic state transitions. In this case, a decision $\delta_t(i)$ leads to a known state, so we can also characterize the decision $\delta_t(i)$ by $\delta_t(i)=j \in S$. Then, the Markov decision model description with discounted rewards (present value) is equivalent to the following network description. The states in $S$ correspond to the nodes of a directed graph $G=(S, E)$. In this graph, a decision $\delta_t(i)=j$ is represented by an $\operatorname{arc}(i, j) \in E$. The set of arcs leaving $i$ therefore represents all possible decisions in $i$. With every arc (that is, directed edge) $(i, j) \in E$ in $G$ is associated a reward $r(i, j)$; this completes the network description. In this network, for every initial node (that is, initial state), a policy generates a path of infinite length (the length of a path is the number of arcs in the path) with corresponding sequence of rewards.

# 运筹学代考

## 数学代写|运筹学作业代写operational research代考|Markov Decision Processes

$$\mathbb{P}\left(X_{t+1}=j \mid X_0=x_0, B_0=a_0 ; X_1=x_1, B_1=a_1 ; \ldots ; X_t=i, B_t=a_t\right)$$

(i) 我们收到预期的即时奖励 $r(i, a)$;
(ii) 过渡到状态 $X_{t+1}$ 在时间 $t+1$ 根据转移概率发生
$$p(j \mid i, a)=\mathbb{P}\left(X_{t+1}=j \mid X_t=i, B_t=a\right)$$

## 数学代写|运筹学作业代写operational research代考|Markov Decision Processes: Discounted Rewards

$$\max \pi \in C V \pi(i) \text { for all } i \in S$$

$$V(i)=\max \pi \in C V \pi(i), \quad i \in S$$

myassignments-help数学代考价格说明

1、客户需提供物理代考的网址，相关账户，以及课程名称，Textbook等相关资料~客服会根据作业数量和持续时间给您定价~使收费透明，让您清楚的知道您的钱花在什么地方。

2、数学代写一般每篇报价约为600—1000rmb，费用根据持续时间、周作业量、成绩要求有所浮动(持续时间越长约便宜、周作业量越多约贵、成绩要求越高越贵)，报价后价格觉得合适，可以先付一周的款，我们帮你试做，满意后再继续，遇到Fail全额退款。

3、myassignments-help公司所有MATH作业代写服务支持付半款，全款，周付款，周付款一方面方便大家查阅自己的分数，一方面也方便大家资金周转，注意:每周固定周一时先预付下周的定金，不付定金不予继续做。物理代写一次性付清打9.5折。

Math作业代写、数学代写常见问题

myassignments-help擅长领域包含但不是全部: