马尔可夫决策过程(MDP)资料下载.pdf

上传人:b****2 文档编号:16123832 上传时间:2022-11-20 格式:PDF 页数:60 大小:9.50MB
下载 相关 举报
马尔可夫决策过程(MDP)资料下载.pdf_第1页
第1页 / 共60页
马尔可夫决策过程(MDP)资料下载.pdf_第2页
第2页 / 共60页
马尔可夫决策过程(MDP)资料下载.pdf_第3页
第3页 / 共60页
马尔可夫决策过程(MDP)资料下载.pdf_第4页
第4页 / 共60页
马尔可夫决策过程(MDP)资料下载.pdf_第5页
第5页 / 共60页
点击查看更多>>
下载资源
资源描述

马尔可夫决策过程(MDP)资料下载.pdf

《马尔可夫决策过程(MDP)资料下载.pdf》由会员分享,可在线阅读,更多相关《马尔可夫决策过程(MDP)资料下载.pdf(60页珍藏版)》请在冰豆网上搜索。

马尔可夫决策过程(MDP)资料下载.pdf

5231045P(4|3)?

Randomwalk?

2,)=P(Xt+1|Xt,Xt?

1)St=(Xt,Xt?

1)St2(s,s),(s,r),(r,s),(r,r)?

randomwalk?

Markovrewardprocess(MRP)10Reward20MDP=Markovprocess+reward/utilityfunctions?

+?

231045Reward5Reward0RewardRewardRewardu(S=3)u(S=4)0.10.90.20.81.01.0MRP?

statetransitionprob.?

rewardfunction?

discountfactor?

SPU?

MRP-?

Reward20231045Reward5Reward0RewardRewardRewardu(S=3)u(S=4)0.10.90.20.81.01.0MRP13Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0?

Reward?

immediate?

“?

”?

SH(S)startfromhereMRP?

Backwardinduction14Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0H(S=4)=u(S=4)=2H(S=5)=u(S=5)=9MRP?

15Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0?

H(S=3)=u(S=3)+?

0.2H(S=4)+0.8H(S=5)=6+?

0.22+0.89?

20,1)MRP?

16Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0?

H(S=2),H(S=1),H(S=3)=u(S=3)+?

0.22+0.89MRP17Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0H(St)=E?

u(St)+?

H(St+1)H(S)=u(S)+?

XS02SP(S,S0)H(S0)?

(?

)?

absorbingstate?

Reward202310Reward5Reward0Reward61.01.01.01.0MRP-?

Valueiteration19Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0H(S),0,8S2S?

H(S)=u(S)+?

H(S)?

Markovdecisionprocess(MDP)22123ActionA1Reward20CurrentstatePossiblefuturestatePossiblefuturestateMDP=Markovprocess+actions+rewardfunctions?

123ActionA2Reward5CurrentstatePossiblefuturestatePossiblefuturestate0.10.9?

1?

2?

Markovdecisionprocess(MDP)23ActionA1Reward20MDP=Markovprocess+actions+rewardfunctions?

action?

SAPU?

CMDP?

POMDP?

action/decision?

reward-?

PSA?

Policy?

:

S7!

ASA?

A?

Bellman?

Bellmanequation?

XS02SP(S,S0)H(S0)H(S,A)=u(S,A)+?

XS02SP(S,A,S0)U(S0)U(S)=maxA2AH(S,A)?

(S)=argmaxH(A,S)Bellmanequation?

backwardinduction?

(Valueiterationalgorithm)?

0?

Bellmaneqn?

123ActionA2Reward5ActionA1Reward20CurrentstatePossiblefuturestatePossiblefuturestate0.10.9U0(S),0,8S2SHn+1(S,A)=u(S,A)+?

XS02SP(S,A,S0)Un(S0)Un+1(S)=maxA2AHn+1(S,A)U(S)ValueiterationalgorithmForeachstate:

SH0(S),0Repeatuntilconverge:

Foreachstate:

SForeachaction:

AHn+1(S,A)=u(S,A)+?

XS02SP(S,A,S0)Un(S0)ComputeComputeandstore?

n+1(S)=argmaxAHn+1(S,A)ComputeandstoreUn+1(S)=maxA2AHn+1(S,A)Return?

(S),U(S),8S2S?

(Policyiterationalgorithm)?

Valueiteration?

0(S),8S2S?

n+1(S):

A,8S2S?

-Theprincipleofoptimality?

O(|A|S|2)|A|S|f(x)=x?

State0State1State2State3S=0,1,2,3A=Left,Right0123Action:

LeftAction:

RightReward:

-1foreverystepmovedDiscountfactor:

0.5MDP?

State0State1State2State30123Action:

RightP(A=Left)=266410001000010000103775P(A=Right)=266410000010000100013775?

Value:

H=0.00.00.00.0Action:

/Value:

H=0.0-1.0-1.0-1.0Action:

H=0.0-1.0-1.5-1.5Action:

Period1Period2Period3MDP?

?

41Copyright:

Forbes?

RFenergyTx/RxFriisformulaBeamforming?

42PowercasterTxandRxPChargingstation?

Electricitychargers?

Atdifferentfixedlocations,e.g.,poweroutlets,basestationsEndusersofenergy?

Thosewhoneedenergy,butarenotcoveredbychargersMobileenergygateway?

Movingandcharge/transferring(wirelessly)43?

Buy/SellenergyEnergygatewaybuysfromchargers(Charging)EachchargerasksacertainpricewhenchargingEnergygatewaysellstoendusers(Transferring)Moreusers,morepaymentsNearusergetsmoreenergy,thushigherpayments44“?

”“?

Mobileenergygateway?

enduserofenergy?

RF?

45?

;

46S=S=(L,E,N,P)LENdecidesenduserpaymentA=A=0,1,2PMDP:

47f(n,l|N)=3RB(n+23,N?

n+1)B(N?

n+1,n)?

l3R3;

n+23,N?

n+1R(n,ES)=ZR?

0f(n,l|N)r(eDn)dl+ZRR?

f(n,l|N)r(gESl2)dlnthsumuptogetoverallpaymentMDP:

48MDP:

49P(A=1)=2664.0.30.7.3775P(A=0)=2664.1.00.0.3775?

valueiterationalgorithm?

pymdptoolbox?

mdptoolbox?

Matlab?

GreedyschemeGRDY?

maximizingimmediateutilityRandomschemeRND?

randomlytakinganyaction(i.e.,0,1,2)fromtheactionsetLocation-awareschemeLOCA:

chargingatcharger,transfe

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 高等教育 > 其它

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1