马尔可夫决策过程(MDP).pdf

上传人:b****2 文档编号:3217025 上传时间:2022-11-20 格式:PDF 页数:60 大小:9.50MB
下载 相关 举报
马尔可夫决策过程(MDP).pdf_第1页
第1页 / 共60页
马尔可夫决策过程(MDP).pdf_第2页
第2页 / 共60页
马尔可夫决策过程(MDP).pdf_第3页
第3页 / 共60页
马尔可夫决策过程(MDP).pdf_第4页
第4页 / 共60页
马尔可夫决策过程(MDP).pdf_第5页
第5页 / 共60页
点击查看更多>>
下载资源
资源描述

马尔可夫决策过程(MDP).pdf

《马尔可夫决策过程(MDP).pdf》由会员分享,可在线阅读,更多相关《马尔可夫决策过程(MDP).pdf(60页珍藏版)》请在冰豆网上搜索。

马尔可夫决策过程(MDP).pdf

?

Markovdecisionprocess(MDP)?

/?

Email:

?

MarkovProcess?

MarkovRewardProcess?

MRP?

MarkovDecisionProcess?

MDP?

MDP?

MDP?

MDP?

MarkovProcess?

MarkovRewardProcess?

MRP?

MarkovDecisionProcess?

MDP?

MDP?

MDP?

MDP?

P(Xt+1|Xt,Xt?

1,Xt?

2,)=P(Xt+1|Xt)XtXt+1Xt?

1,Xt?

2,(Xt,t2I)?

-?

5231045P(4|3)?

Randomwalk?

P(Xt+1|Xt,Xt?

1,Xt?

2,)=P(Xt+1|Xt,Xt?

1)St=(Xt,Xt?

1)St2(s,s),(s,r),(r,s),(r,r)?

/?

randomwalk?

MarkovProcess?

MarkovRewardProcess?

MRP?

MarkovDecisionProcess?

MDP?

MDP?

MDP?

MDP?

Markovrewardprocess(MRP)10Reward20MDP=Markovprocess+reward/utilityfunctions?

+?

/?

231045Reward5Reward0RewardRewardRewardu(S=3)u(S=4)0.10.90.20.81.01.0MRP?

MRP?

statetransitionprob.?

rewardfunction?

/?

discountfactor?

SPU?

MRP-?

Reward20231045Reward5Reward0RewardRewardRewardu(S=3)u(S=4)0.10.90.20.81.01.0MRP13Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0?

Reward?

immediate?

“?

”?

SH(S)startfromhereMRP?

-?

Backwardinduction14Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0H(S=4)=u(S=4)=2H(S=5)=u(S=5)=9MRP?

-?

15Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0?

“?

”?

H(S=3)=u(S=3)+?

0.2H(S=4)+0.8H(S=5)=6+?

0.22+0.89?

20,1)MRP?

-?

16Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0?

H(S=2),H(S=1),H(S=3)=u(S=3)+?

0.2H(S=4)+0.8H(S=5)=6+?

0.22+0.89MRP17Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0H(St)=E?

u(St)+?

H(St+1)H(S)=u(S)+?

XS02SP(S,S0)H(S0)?

(?

)?

MRP-?

absorbingstate?

Reward202310Reward5Reward0Reward61.01.01.01.0MRP-?

Valueiteration19Reward20231045Reward5Reward0Reward6Reward2Reward90.10.90.20.81.01.0H(S),0,8S2S?

H(S)=u(S)+?

XS02SP(S,S0)H(S0)?

H(S)?

MarkovProcess?

MarkovRewardProcess?

MRP?

MarkovDecisionProcess?

MDP?

MDP?

MDP?

MDP?

Markovdecisionprocess(MDP)22123ActionA1Reward20CurrentstatePossiblefuturestatePossiblefuturestateMDP=Markovprocess+actions+rewardfunctions?

+?

+?

123ActionA2Reward5CurrentstatePossiblefuturestatePossiblefuturestate0.10.9?

1?

2?

Markovdecisionprocess(MDP)23ActionA1Reward20MDP=Markovprocess+actions+rewardfunctions?

+?

+?

123ActionA2Reward5CurrentstatePossiblefuturestatePossiblefuturestate0.10.9?

1?

2?

MDP?

MDP?

action?

statetransitionprob.?

rewardfunction?

/?

discountfactor?

SAPU?

CMDP?

POMDP?

MDP?

MDP?

MDP?

“?

”?

“?

”?

action/decision?

MDP?

MDP?

-?

-?

reward-?

PSA?

MarkovProcess?

MarkovRewardProcess?

MRP?

MarkovDecisionProcess?

MDP?

MDP?

MDP?

MDP?

MDP?

MDP?

“?

”?

Policy?

:

S7!

ASA?

:

S7!

A?

Bellman?

Bellmanequation?

MRP?

MDP?

H(S)=u(S)+?

XS02SP(S,S0)H(S0)H(S,A)=u(S,A)+?

XS02SP(S,A,S0)U(S0)U(S)=maxA2AH(S,A)?

(S)=argmaxH(A,S)Bellmanequation?

Bellmanequation?

MRP?

backwardinduction?

absorbingstate?

“?

”?

Bellmanequation?

(Valueiterationalgorithm)?

0?

Bellmaneqn?

123ActionA2Reward5ActionA1Reward20CurrentstatePossiblefuturestatePossiblefuturestate0.10.9U0(S),0,8S2SHn+1(S,A)=u(S,A)+?

XS02SP(S,A,S0)Un(S0)Un+1(S)=maxA2AHn+1(S,A)U(S)ValueiterationalgorithmForeachstate:

SH0(S),0Repeatuntilconverge:

Foreachstate:

SForeachaction:

AHn+1(S,A)=u(S,A)+?

XS02SP(S,A,S0)Un(S0)ComputeComputeandstore?

n+1(S)=argmaxAHn+1(S,A)ComputeandstoreUn+1(S)=maxA2AHn+1(S,A)Return?

(S),U(S),8S2S?

Bellmanequation?

(Policyiterationalgorithm)?

Valueiteration?

Bellmaneqn?

0(S),8S2S?

n+1(S):

S7!

A,8S2S?

Bellmanequation?

-Theprincipleofoptimality?

O(|A|S|2)|A|S|f(x)=x?

MarkovProcess?

MarkovRewardProcess?

MRP?

MarkovDecisionProcess?

MDP?

MDP?

MDP?

MDP?

MDP?

State0State1State2State3S=0,1,2,3A=Left,Right0123Action:

LeftAction:

RightReward:

-1foreverystepmovedDiscountfactor:

0.5MDP?

State0State1State2State30123Action:

LeftAction:

RightP(A=Left)=266410001000010000103775P(A=Right)=266410000010000100013775?

Value:

H=0.00.00.00.0Action:

/Value:

H=0.0-1.0-1.0-1.0Action:

/Value:

H=0.0-1.0-1.5-1.5Action:

?

Period1Period2Period3MDP?

MDP?

MarkovProcess?

MarkovRewardProcess?

MRP?

MarkovDecisionProcess?

MDP?

MDP?

MDP?

MDP?

:

?

?

?

/?

-?

:

?

:

?

41Copyright:

Forbes?

-?

:

?

RFenergyTx/RxFriisformulaBeamforming?

42PowercasterTxandRxPChargingstation?

Electricitychargers?

:

Atdifferentfixedlocations,e.g.,poweroutlets,basestationsEndusersofenergy?

:

Thosewhoneedenergy,butarenotcoveredbychargersMobileenergygateway?

:

Movingandcharge/transferring(wirelessly)43?

Buy/SellenergyEnergygatewaybuysfromchargers(Charging)EachchargerasksacertainpricewhenchargingEnergygatewaysellstoendusers(Transferring)Moreusers,morepaymentsNearusergetsmoreenergy,thushigherpayments44“?

”“?

”?

Mobileenergygateway?

enduserofenergy?

RF?

-?

45?

MDP?

:

?

;:

?

;:

?

:

?

?

?

46S=S=(L,E,N,P)LENdecidesenduserpaymentA=A=0,1,2PMDP:

?

47f(n,l|N)=3RB(n+23,N?

n+1)B(N?

n+1,n)?

l3R3;n+23,N?

n+1R(n,ES)=ZR?

0f(n,l|N)r(eDn)dl+ZRR?

f(n,l|N)r(gESl2)dlnthsumuptogetoverallpaymentMDP:

?

48MDP:

?

49P(A=1)=2664.0.30.7.3775P(A=0)=2664.1.00.0.3775?

Bellmanequation?

valueiterationalgorithm?

pymdptoolbox?

MDP?

mdptoolbox?

Matlab?

MDP?

MDP?

/?

GreedyschemeGRDY?

:

maximizingimmediateutilityRandomschemeRND?

:

randomlytakinganyaction(i.e.,0,1,2)fromtheactionsetLocation-awareschemeLOCA:

chargingatcharger,transfe

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 人文社科 > 设计艺术

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1