1、?Markov decision process(MDP)?/?Email:?Markov Process?Markov Reward Process?MRP?Markov Decision Process?MDP?MDP?MDP?MDP?Markov Process?Markov Reward Process?MRP?Markov Decision Process?MDP?MDP?MDP?MDP?P(Xt+1|Xt,Xt?1,Xt?2,)=P(Xt+1|Xt)XtXt+1Xt?1,Xt?2,(Xt,t 2 I)?-?5231045P(4|3)?Random walk?P(Xt+1|Xt,Xt
2、?1,Xt?2,)=P(Xt+1|Xt,Xt?1)St=(Xt,Xt?1)St2(s,s),(s,r),(r,s),(r,r)?/?random walk?Markov Process?Markov Reward Process?MRP?Markov Decision Process?MDP?MDP?MDP?MDP?Markov reward process(MRP)10Reward 20MDP=Markov process+reward/utility functions?+?/?231045Reward 5Reward 0RewardRewardReward u(S=3)u(S=4)0.1
3、0.90.20.81.01.0MRP?MRP?state transition prob.?reward function?/?discount factor?SPU?MRP-?Reward 20231045Reward 5Reward 0RewardRewardReward u(S=3)u(S=4)0.10.90.20.81.01.0MRP13Reward 20231045Reward 5Reward 0Reward 6Reward 2Reward 90.10.90.20.81.01.0?Reward?immediate?“?”?SH(S)start from hereMRP?-?Backw
4、ard induction14Reward 20231045Reward 5Reward 0Reward 6Reward 2Reward 90.10.90.20.81.01.0H(S=4)=u(S=4)=2H(S=5)=u(S=5)=9MRP?-?15Reward 20231045Reward 5Reward 0Reward 6Reward 2Reward 90.10.90.20.81.01.0?“?”?H(S=3)=u(S=3)+?0.2H(S=4)+0.8H(S=5)=6+?0.2 2+0.8 9?2 0,1)MRP?-?16Reward 20231045Reward 5Reward 0R
5、eward 6Reward 2Reward 90.10.90.20.81.01.0?H(S=2),H(S=1),H(S=3)=u(S=3)+?0.2H(S=4)+0.8H(S=5)=6+?0.2 2+0.8 9MRP17Reward 20231045Reward 5Reward 0Reward 6Reward 2Reward 90.10.90.20.81.01.0H(St)=E?u(St)+?H(St+1)H(S)=u(S)+?XS02SP(S,S0)H(S0)?(?)?MRP-?absorbing state?Reward 202310Reward 5Reward 0Reward 61.01
6、.01.01.0MRP-?Value iteration19Reward 20231045Reward 5Reward 0Reward 6Reward 2Reward 90.10.90.20.81.01.0H(S),0,8S 2 S?H(S)=u(S)+?XS02SP(S,S0)H(S0)?H(S)?Markov Process?Markov Reward Process?MRP?Markov Decision Process?MDP?MDP?MDP?MDP?Markov decision process(MDP)22123Action A1Reward 20Current statePoss
7、ible future statePossible future stateMDP=Markov process+actions+reward functions?+?+?123Action A2Reward 5Current statePossible future statePossible future state0.10.9?1?2?Markov decision process(MDP)23Action A1Reward 20MDP=Markov process+actions+reward functions?+?+?123Action A2Reward 5Current stat
8、ePossible future statePossible future state0.10.9?1?2?MDP?MDP?action?state transition prob.?reward function?/?discount factor?SAPU?CMDP?POMDP?MDP?MDP?MDP?“?”?“?”?action/decision?MDP?MDP?-?-?reward-?PSA?Markov Process?Markov Reward Process?MRP?Markov Decision Process?MDP?MDP?MDP?MDP?MDP?MDP?“?”?Polic
9、y?:S 7!ASA?:S 7!A?Bellman?Bellman equation?MRP?MDP?H(S)=u(S)+?XS02SP(S,S0)H(S0)H(S,A)=u(S,A)+?XS02SP(S,A,S0)U(S0)U(S)=maxA2AH(S,A)?(S)=argmaxH(A,S)Bellman equation?Bellman equation?MRP?backward induction?absorbing state?“?”?Bellman equation?(Value iteration algorithm)?0?Bellman eqn?123Action A2Rewar
10、d 5Action A1Reward 20Current statePossible future statePossible future state0.10.9U0(S),0,8S 2 SHn+1(S,A)=u(S,A)+?XS02SP(S,A,S0)Un(S0)Un+1(S)=maxA2AHn+1(S,A)U(S)Value iteration algorithmFor each state :SH0(S),0Repeat until converge:For each state :SFor each action :AHn+1(S,A)=u(S,A)+?XS02SP(S,A,S0)U
11、n(S0)ComputeCompute and store?n+1(S)=argmaxAHn+1(S,A)Compute and storeUn+1(S)=maxA2AHn+1(S,A)Return?(S),U(S),8S 2 S?Bellman equation?(Policy iteration algorithm)?Value iteration?Bellman eqn?0(S),8S 2 S?n+1(S):S 7!A,8S 2 S?Bellman equation?-The principle of optimality?O(|A|S|2)|A|S|f(x)=x?Markov Proc
12、ess?Markov Reward Process?MRP?Markov Decision Process?MDP?MDP?MDP?MDP?MDP?State 0State 1State 2State 3S=0,1,2,3A=Left,Right0123Action:LeftAction:RightReward:-1 for every step movedDiscount factor:0.5MDP?State 0State 1State 2State 30123Action:LeftAction:RightP(A=Left)=266410001000010000103775P(A=Righ
13、t)=266410000010000100013775?Value:H=0.00.00.00.0Action:/Value:H=0.0-1.0-1.0-1.0Action:/Value:H=0.0-1.0-1.5-1.5Action:?Period 1Period 2Period 3MDP?MDP?Markov Process?Markov Reward Process?MRP?Markov Decision Process?MDP?MDP?MDP?MDP?:?,?,?/?-?:?:?41Copyright:Forbes?-?:?RF energy Tx/Rx Friis formula Be
14、amforming?42Powercaster Tx and RxPCharging station?Electricity chargers?:At different fixed locations,e.g.,power outlets,base stations End users of energy?:Those who need energy,but are not covered by chargers Mobile energy gateway?:Moving and charge/transferring(wirelessly)43?Buy/Sell energyEnergy
15、gateway buys from chargers(Charging)Each charger asks a certain price when charging Energy gateway sells to end users(Transferring)More users,more payments Near user gets more energy,thus higher payments44“?”“?”?Mobile energy gateway?end user of energy?RF?-?45?MDP?:?;:?;:?:?,?,?46S=S=(L,E,N,P)LENdec
16、ides end user paymentA=A=0,1,2PMDP:?47f(n,l|N)=3RB(n+23,N?n+1)B(N?n+1,n)?l3R3;n+23,N?n+1R(n,ES)=ZR?0f(n,l|N)r(eDn)dl+ZRR?f(n,l|N)r(gESl2)dlnthsum up to get overall paymentMDP:?48MDP:?49P(A=1)=2664.0.30.7.3775P(A=0)=2664.1.00.0.3775?Bellman equation?value iteration algorithm?pymdptoolbox?MDP?mdptoolbox?Matlab?MDP?MDP?/?Greedy scheme GRDY?:maximizing immediate utility Random scheme RND?:randomly taking any action(i.e.,0,1,2)from the action set Location-aware scheme LOCA:charging at charger,transfe
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1