马尔可夫决策过程(MDP)资料下载.pdf
《马尔可夫决策过程(MDP)资料下载.pdf》由会员分享,可在线阅读,更多相关《马尔可夫决策过程(MDP)资料下载.pdf(60页珍藏版)》请在冰豆网上搜索。
5231045P(4|3)?
Randomwalk?
2,)=P(Xt+1|Xt,Xt?
1)St=(Xt,Xt?
1)St2(s,s),(s,r),(r,s),(r,r)?
randomwalk?
Markovrewardprocess(MRP)10Reward20MDP=Markovprocess+reward/utilityfunctions?
+?
231045Reward5Reward0RewardRewardRewardu(S=3)u(S=4)0.10.90.20.81.01.0MRP?
statetransitionprob.?
rewardfunction?
discountfactor?
SPU?
MRP-?
Reward20231045Reward5Reward0RewardRewardRewardu(S=3)u(S=4)0.10.90.20.81.01.0MRP13Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0?
Reward?
immediate?
“?
”?
SH(S)startfromhereMRP?
Backwardinduction14Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0H(S=4)=u(S=4)=2H(S=5)=u(S=5)=9MRP?
15Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0?
H(S=3)=u(S=3)+?
0.2H(S=4)+0.8H(S=5)=6+?
0.22+0.89?
20,1)MRP?
16Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0?
H(S=2),H(S=1),H(S=3)=u(S=3)+?
0.22+0.89MRP17Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0H(St)=E?
u(St)+?
H(St+1)H(S)=u(S)+?
XS02SP(S,S0)H(S0)?
(?
)?
absorbingstate?
Reward202310Reward5Reward0Reward61.01.01.01.0MRP-?
Valueiteration19Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0H(S),0,8S2S?
H(S)=u(S)+?
H(S)?
Markovdecisionprocess(MDP)22123ActionA1Reward20CurrentstatePossiblefuturestatePossiblefuturestateMDP=Markovprocess+actions+rewardfunctions?
123ActionA2Reward5CurrentstatePossiblefuturestatePossiblefuturestate0.10.9?
1?
2?
Markovdecisionprocess(MDP)23ActionA1Reward20MDP=Markovprocess+actions+rewardfunctions?
action?
SAPU?
CMDP?
POMDP?
action/decision?
reward-?
PSA?
Policy?
:
S7!
ASA?
A?
Bellman?
Bellmanequation?
XS02SP(S,S0)H(S0)H(S,A)=u(S,A)+?
XS02SP(S,A,S0)U(S0)U(S)=maxA2AH(S,A)?
(S)=argmaxH(A,S)Bellmanequation?
backwardinduction?
(Valueiterationalgorithm)?
0?
Bellmaneqn?
123ActionA2Reward5ActionA1Reward20CurrentstatePossiblefuturestatePossiblefuturestate0.10.9U0(S),0,8S2SHn+1(S,A)=u(S,A)+?
XS02SP(S,A,S0)Un(S0)Un+1(S)=maxA2AHn+1(S,A)U(S)ValueiterationalgorithmForeachstate:
SH0(S),0Repeatuntilconverge:
Foreachstate:
SForeachaction:
AHn+1(S,A)=u(S,A)+?
XS02SP(S,A,S0)Un(S0)ComputeComputeandstore?
n+1(S)=argmaxAHn+1(S,A)ComputeandstoreUn+1(S)=maxA2AHn+1(S,A)Return?
(S),U(S),8S2S?
(Policyiterationalgorithm)?
Valueiteration?
0(S),8S2S?
n+1(S):
A,8S2S?
-Theprincipleofoptimality?
O(|A|S|2)|A|S|f(x)=x?
State0State1State2State3S=0,1,2,3A=Left,Right0123Action:
LeftAction:
RightReward:
-1foreverystepmovedDiscountfactor:
0.5MDP?
State0State1State2State30123Action:
RightP(A=Left)=266410001000010000103775P(A=Right)=266410000010000100013775?
Value:
H=0.00.00.00.0Action:
/Value:
H=0.0-1.0-1.0-1.0Action:
H=0.0-1.0-1.5-1.5Action:
Period1Period2Period3MDP?
?
41Copyright:
Forbes?
RFenergyTx/RxFriisformulaBeamforming?
42PowercasterTxandRxPChargingstation?
Electricitychargers?
Atdifferentfixedlocations,e.g.,poweroutlets,basestationsEndusersofenergy?
Thosewhoneedenergy,butarenotcoveredbychargersMobileenergygateway?
Movingandcharge/transferring(wirelessly)43?
Buy/SellenergyEnergygatewaybuysfromchargers(Charging)EachchargerasksacertainpricewhenchargingEnergygatewaysellstoendusers(Transferring)Moreusers,morepaymentsNearusergetsmoreenergy,thushigherpayments44“?
”“?
Mobileenergygateway?
enduserofenergy?
RF?
45?
;
46S=S=(L,E,N,P)LENdecidesenduserpaymentA=A=0,1,2PMDP:
47f(n,l|N)=3RB(n+23,N?
n+1)B(N?
n+1,n)?
l3R3;
n+23,N?
n+1R(n,ES)=ZR?
0f(n,l|N)r(eDn)dl+ZRR?
f(n,l|N)r(gESl2)dlnthsumuptogetoverallpaymentMDP:
48MDP:
49P(A=1)=2664.0.30.7.3775P(A=0)=2664.1.00.0.3775?
valueiterationalgorithm?
pymdptoolbox?
mdptoolbox?
Matlab?
GreedyschemeGRDY?
maximizingimmediateutilityRandomschemeRND?
randomlytakinganyaction(i.e.,0,1,2)fromtheactionsetLocation-awareschemeLOCA:
chargingatcharger,transfe