基于深度强化学习的flappybird.docx

上传人:b****5 文档编号:4176170 上传时间:2022-11-28 格式:DOCX 页数:15 大小:820.18KB
下载 相关 举报
基于深度强化学习的flappybird.docx_第1页
第1页 / 共15页
基于深度强化学习的flappybird.docx_第2页
第2页 / 共15页
基于深度强化学习的flappybird.docx_第3页
第3页 / 共15页
基于深度强化学习的flappybird.docx_第4页
第4页 / 共15页
基于深度强化学习的flappybird.docx_第5页
第5页 / 共15页
点击查看更多>>
下载资源
资源描述

基于深度强化学习的flappybird.docx

《基于深度强化学习的flappybird.docx》由会员分享,可在线阅读,更多相关《基于深度强化学习的flappybird.docx(15页珍藏版)》请在冰豆网上搜索。

基于深度强化学习的flappybird.docx

基于深度强化学习的flappybird

SHANGHAIJIAOTONGUNIVERSITY

 

ProjectTitle:

PlayingtheGameofFlappyBirdwithDeepReinforcementLearning

GroupNumber:

G-07

GroupMembers:

WangWenqing116032910080

GaoXiaoning116032910032

QianChen11603

 

PlayingtheGameofFlappyBirdwithDeepReinforcementLearning

Abstract

LettingmachineplaygameshasbeenoneofthepopulartopicsinAItoday.Usinggametheoryandsearchalgorithmstoplaygamesrequiresspecificdomainknowledge,lackingscalability.Inthisproject,weutilizeaconvolutionalneuralnetworktorepresenttheenvironmentofgames,updatingitsparameterswithQ-learning,areinforcementlearningalgorithm.WecallthisoverallalgorithmasdeepreinforcementlearningorDeepQ-learningNetwork(DQN).Moreover,weonlyusetherawimagesofthegameofflappybirdastheinputofDQN,whichguaranteesthescalabilityforothergames.Aftertrainingwithsometricks,DQNcangreatlyoutperformhumanbeings.

1Introduction

Flappybirdisapopulargameintheworldrecentyears.Thegoalofplayersisguidingthebirdonscreentopassthegapconstructedbytwopipesbytappingscreen.Iftheplayertapthescreen,thebirdwilljumpup,andiftheplayerdonothing,thebirdwillfalldownataconstantrate.Thegamewillbeoverwhenthebirdcrashonpipesorground,whilethescoreswillbeaddedonewhenthebirdpassthroughthegap.InFigure1,therearethreedifferentstateofbird.Figure1(a)representsthenormalflightstate,(b)representsthecrashstate,(c)representsthepassingstate.

(a)(b)(c)

Figure1:

(a)normalflightstate(b)crashstate(c)passingstate

OurgoalinthispaperistodesignanagenttoplayFlappybirdautomaticallywiththesameinputcomparingtohumanplayer,whichmeansthatweuserawimagesandrewardstoteachouragenttolearnhowtoplaythisgame.Inspiredby[1],weproposeadeepreinforcementlearningarchitecturetolearnandplaythisgame.

Recentyears,ahugeamountofworkhasbeendoneondeeplearningincomputervision[6].Deeplearningextractshighdimensionfeaturesfromrawimages.Therefore,itisnaturetoaskwhetherthedeeplearningcanbeusedinreinforcementlearning.However,therearefourchallengesinusingdeeplearning.Firstly,mostsuccessfuldeeplearningapplicationstodatehaverequiredlargeamountsofhand-labelledtrainingdata.RLalgorithms,ontheotherhand,mustbeabletolearnfromascalarrewardsignalthatisfrequentlysparse,noisyanddelayed.Secondly,thedelaybetweenactionsandresultingrewards,whichcanbethousandsoftimestepslong,seemsparticularlydauntingwhencomparedtothedirectassociationbetweeninputsandtargetsfoundinsupervisedlearning.Thethirdissueisthatmostdeeplearningalgorithmsassumethedatasamplestobeindependent,whileinreinforcementlearningonetypicallyencounterssequencesofhighlycorrelatedstates.Furthermore,inRLthedatadistributionchangesasthealgorithmlearnsnewbehaviors,whichcanbeproblematicfordeeplearningmethodsthatassumeafixedunderlyingdistribution.

ThispaperwilldemonstratethatusingConvolutionalNeuralNetwork(CNN)canovercomethosechallengementionedaboveandlearnsuccessfulcontrolpolicesfromrawimagesdatainthegameFlappybird.ThisnetworkistrainedwithavariantoftheQ-learningalgorithm[6].ByusingDeepQ-learningNetwork(DQN),weconstructtheagenttomakerightdecisionsonthegameflappybirdbarelyaccordingtoconsequentrawimages.

2DeepQ-learningNetwork

Recentbreakthroughsincomputervisionhavereliedonefficientlytrainingdeepneuralnetworksonverylargetrainingsets.Byfeedingsufficientdataintodeepneuralnetworks,itisoftenpossibletolearnbetterrepresentationsthanhandcraftedfeatures[2][3].Thesesuccessesmotivateustoconnectareinforcementlearningalgorithmtoadeepneuralnetwork,whichoperatesdirectlyonrawimagesandefficientlyupdateparametersbyusingstochasticgradientdescent.

Inthefollowingsection,wedescribetheDeepQ-learningNetworkalgorithm(DQN)andhowitsmodelisparameterized.

2.1Q-learning

2.1.1ReinforcementLearningProblem

Q-learningisaspecificalgorithmofreinforcementlearning(RL).AsFigure2show,anagentinteractswithitsenvironmentindiscretetimesteps.Ateachtimet,theagentreceivesanstate

andareward

.Itthenchoosesanaction

fromthesetofactionsavailable,whichissubsequentlysenttotheenvironment.Theenvironmentmovestoanewstate

andthereward

associatedwiththetransition

isdetermined[4].

Figure2:

TraditionalReinforcementLearningscenario

Thegoalofanagentistocollectasmuchrewardaspossible.Theagentcanchooseanyactionasafunctionofthehistoryanditcanevenrandomizeitsactionselection.Notethatinordertoactnearoptimally,theagentmustreasonaboutthelongtermconsequencesofitsactions(i.e.,maximizethefutureincome),althoughtheimmediaterewardassociatedwiththismightbenegative[5].

2.1.2Q-learningFormulation[6]

InQ-learningproblem,thesetofstatesandactions,togetherwithrulesfortransitioningfromonestatetoanother,makeupaMarkovdecisionprocess.Oneepisodeofthisprocess(e.g.onegame)formsafinitesequenceofstates,actionsandrewards:

Here

representsthestate,

istheactionand

istherewardafterperformingtheaction

.Theepisodeendswithterminalstate

.Toperformwellinthelong-term,weneedtotakeintoaccountnotonlytheimmediaterewards,butalsothefuturerewardswearegoingtoget.Definethetotalfuturerewardfromtimepointtonwardas:

Inordertoensurethedivergenceandbalancetheimmediaterewardandfuturereward,totalrewardmustusediscountedfuturereward:

Here

isthediscountfactorbetween0and1,themoreintothefuturetherewardis,thelesswetakeitintoconsideration.Transformingequationcanget:

InQ-learning,defineafunction

representingthemaximumdiscountedfuturerewardwhenweperformaction

instate:

ItiscalledQ-function,becauseitrepresentsthe“quality”ofacertainactioninagivenstate.Agoodstrategyforanagentwouldbetoalwayschooseanactionthatmaximizesthediscountedfuturereward:

Hereπrepresentsthepolicy,therulehowwechooseanactionineachstate.Givenatransition

equationcangetfollowingbellmanequation-maximumfuturerewardforthisstateandactionistheimmediaterewardplusmaximumfuturerewardforthenextstate:

Theonlywaytocollectinformationabouttheenvironmentisbyinteractingwithit.Q-learningistheprocessoflearningtheoptimalfunction

whichisatablein.Hereistheoverallalgorithm1:

Algorithm1Q-learning

InitializeQ[num_states,num_actions]arbitrarily

Observeinitialstates0

Repeat

Selectandcarryoutanactiona

Observerewardrandnewstates’

s=s’

Untilterminated

2.2DeepQ-learningNetwork

InQ-learning,thestatespaceoftenistoobigtobeputintomainmemory.Agameframeof

binaryimageshas

states,whichisimpossibletoberepresentedbyQ-table.What’smore,duringtraining,encounteringaknownstate,Q-learningjustperformarandomaction,meaningthatit’snotheuristic.Inorderovercomethesetwoproblems,justapproximatetheQ-tablewithaconvolutionalneuralnetworks(CNN)[7][8].ThisvariationofQ-learningiscalledDeepQ-learningNetwork(DQN)[9][10].

AftertrainingtheDQN,amultilayerneuralnetworkscanapproachthetraditionaloptimalQ-tableasfollowed:

Asforplayingflappybird,thescreenshotstisinputtedintotheCNN,andtheoutputsaretheQ-valueofactions,asshowninFigure3:

Figure3:

InDQN,CNN’sinputisrawgameimagewhileitsoutputsareQ-valuesQ(s,a),oneneuroncorrespondingtooneaction’sQ-value.

InordertoupdateCNN’sweight,definingthecostfunctionandgradientupdatefunctionas[9][10]:

Here,

aretheDQNparametersthatgettrainedand

arenon-updatedparametersfortheQ-valuefunction.Duringtraining,useequationtoupdatetheweightsofCNN.

Meanwhile,obtainingoptimalrewardineveryepisoderequiresthebalancebetweenexploringtheenvironmentandexploitingexperience.

-greedyapproachcanachievethistarget.Whentraining,selectarandomactionwithprobability

orotherwisechoosetheoptimalaction

.The

annealslinearlytozerowithincreaseinnumberofupdates.

2.3InputPre-processing

Workingdirectlywithrawgameframes,whichare

pixelRGBimages,canbecomputationallydemanding,soweapplyabasicpreprocessingstepaimedatreducingtheinputdimensionality.

Figure4:

Pre-processgameframes.Firstconvertframestograyimages,thendown-samplethemtospecificsize.Afterwards,convertthemtobinaryimages,finallystackuplast4framesasastate.

Inordertoimprovetheaccuracyoftheconvolutionalnetwork,thebackgroundofgamewasremovedandsubstitutedwithapureblackimagetoremovenoise.AsFigure4shows,therawgameframesarepreprocessedbyfirstconvertingtheirRGBrepresentationtogray-scaleanddown-samplingittoan

image.Thenconvertthegrayimagetobinaryimage.Inaddition,stackuplast4gameframesasastateforCNN.Thecurrentframeisoverlappedwiththepreviousframeswithslightlyreducedintensitiesandtheintensityreducesaswemovefartherawayfromthemostrecentframe.Thus,theinputimagewillgivegoodinformationonthetrajectoryonwhichthebirdiscurrentlyin.

2.4ExperienceReplayandStability

BynowwecanestimatethefuturerewardineachstateusingQ-learningandapproximatetheQ-functionusingaconvolut

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 小学教育 > 数学

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1