完整版基于深度强化学习的flappybird.docx
《完整版基于深度强化学习的flappybird.docx》由会员分享,可在线阅读,更多相关《完整版基于深度强化学习的flappybird.docx(17页珍藏版)》请在冰豆网上搜索。
完整版基于深度强化学习的flappybird
SHANGHAIJIAOTONGUNIVERSITY
ProjectTitle:
PlayingtheGameofFlappyBirdwithDeepReinforcementLearning
GroupNumber:
G-07
GroupMembers:
WangWenqing116032910080
GaoXiaoning116032910032
QianChen116032910073
PlayingtheGameofFlappyBirdwithDeepReinforcementLearning
Abstract
LettingmachineplaygameshasbeenoneofthepopulartopicsinAItoday.Usinggametheoryandsearchalgorithmstoplaygamesrequiresspecificdomainknowledge,lackingscalability.Inthisproject,weutilizeaconvolutionalneuralnetworktorepresenttheenvironmentofgames,updatingitsparameterswithQ-learning,areinforcementlearningalgorithm.WecallthisoverallalgorithmasdeepreinforcementlearningorDeepQ-learningNetwork(DQN).Moreover,weonlyusetherawimagesofthegameofflappybirdastheinputofDQN,whichguaranteesthescalabilityforothergames.Aftertrainingwithsometricks,DQNcangreatlyoutperformhumanbeings.
1Introduction
Flappybirdisapopulargameintheworldrecentyears.Thegoalofplayersisguidingthebirdonscreentopassthegapconstructedbytwopipesbytappingscreen.Iftheplayertapthescreen,thebirdwilljumpup,andiftheplayerdonothing,thebirdwillfalldownataconstantrate.Thegamewillbeoverwhenthebirdcrashonpipesorground,whilethescoreswillbeaddedonewhenthebirdpassthroughthegap.InFigure1,therearethreedifferentstateofbird.Figure1(a)representsthenormalflightstate,(b)representsthecrashstate,(c)representsthepassingstate.
(a)(b)(c)
Figure1:
(a)normalflightstate(b)crashstate(c)passingstate
OurgoalinthispaperistodesignanagenttoplayFlappybirdautomaticallywiththesameinputcomparingtohumanplayer,whichmeansthatweuserawimagesandrewardstoteachouragenttolearnhowtoplaythisgame.Inspiredby[1],weproposeadeepreinforcementlearningarchitecturetolearnandplaythisgame.
Recentyears,ahugeamountofworkhasbeendoneondeeplearningincomputervision[6].Deeplearningextractshighdimensionfeaturesfromrawimages.Therefore,itisnaturetoaskwhetherthedeeplearningcanbeusedinreinforcementlearning.However,therearefourchallengesinusingdeeplearning.Firstly,mostsuccessfuldeeplearningapplicationstodatehaverequiredlargeamountsofhand-labelledtrainingdata.RLalgorithms,ontheotherhand,mustbeabletolearnfromascalarrewardsignalthatisfrequentlysparse,noisyanddelayed.Secondly,thedelaybetweenactionsandresultingrewards,whichcanbethousandsoftimestepslong,seemsparticularlydauntingwhencomparedtothedirectassociationbetweeninputsandtargetsfoundinsupervisedlearning.Thethirdissueisthatmostdeeplearningalgorithmsassumethedatasamplestobeindependent,whileinreinforcementlearningonetypicallyencounterssequencesofhighlycorrelatedstates.Furthermore,inRLthedatadistributionchangesasthealgorithmlearnsnewbehaviors,whichcanbeproblematicfordeeplearningmethodsthatassumeafixedunderlyingdistribution.
ThispaperwilldemonstratethatusingConvolutionalNeuralNetwork(CNN)canovercomethosechallengementionedaboveandlearnsuccessfulcontrolpolicesfromrawimagesdatainthegameFlappybird.ThisnetworkistrainedwithavariantoftheQ-learningalgorithm[6].ByusingDeepQ-learningNetwork(DQN),weconstructtheagenttomakerightdecisionsonthegameflappybirdbarelyaccordingtoconsequentrawimages.
2DeepQ-learningNetwork
Recentbreakthroughsincomputervisionhavereliedonefficientlytrainingdeepneuralnetworksonverylargetrainingsets.Byfeedingsufficientdataintodeepneuralnetworks,itisoftenpossibletolearnbetterrepresentationsthanhandcraftedfeatures[2][3].Thesesuccessesmotivateustoconnectareinforcementlearningalgorithmtoadeepneuralnetwork,whichoperatesdirectlyonrawimagesandefficientlyupdateparametersbyusingstochasticgradientdescent.
Inthefollowingsection,wedescribetheDeepQ-learningNetworkalgorithm(DQN)andhowitsmodelisparameterized.
2.1Q-learning
2.1.1ReinforcementLearningProblem
Q-learningisaspecificalgorithmofreinforcementlearning(RL).AsFigure2show,anagentinteractswithitsenvironmentindiscretetimesteps.Ateachtimet,theagentreceivesanstate
andareward
.Itthenchoosesanaction
fromthesetofactionsavailable,whichissubsequentlysenttotheenvironment.Theenvironmentmovestoanewstate
andthereward
associatedwiththetransition
isdetermined[4].
Figure2:
TraditionalReinforcementLearningscenario
Thegoalofanagentistocollectasmuchrewardaspossible.Theagentcanchooseanyactionasafunctionofthehistoryanditcanevenrandomizeitsactionselection.Notethatinordertoactnearoptimally,theagentmustreasonaboutthelongtermconsequencesofitsactions(i.e.,maximizethefutureincome),althoughtheimmediaterewardassociatedwiththismightbenegative[5].
2.1.2Q-learningFormulation[6]
InQ-learningproblem,thesetofstatesandactions,togetherwithrulesfortransitioningfromonestatetoanother,makeupaMarkovdecisionprocess.Oneepisodeofthisprocess(e.g.onegame)formsafinitesequenceofstates,actionsandrewards:
Here
representsthestate,
istheactionand
istherewardafterperformingtheaction
.Theepisodeendswithterminalstate
.Toperformwellinthelong-term,weneedtotakeintoaccountnotonlytheimmediaterewards,butalsothefuturerewardswearegoingtoget.Definethetotalfuturerewardfromtimepointtonwardas:
Inordertoensurethedivergenceandbalancetheimmediaterewardandfuturereward,totalrewardmustusediscountedfuturereward:
Here
isthediscountfactorbetween0and1,themoreintothefuturetherewardis,thelesswetakeitintoconsideration.Transformingequationcanget:
InQ-learning,defineafunction
representingthemaximumdiscountedfuturerewardwhenweperformaction
instate:
ItiscalledQ-function,becauseitrepresentsthe“quality”ofacertainactioninagivenstate.Agoodstrategyforanagentwouldbetoalwayschooseanactionthatmaximizesthediscountedfuturereward:
Hereπrepresentsthepolicy,therulehowwechooseanactionineachstate.Givenatransition
equationcangetfollowingbellmanequation-maximumfuturerewardforthisstateandactionistheimmediaterewardplusmaximumfuturerewardforthenextstate:
Theonlywaytocollectinformationabouttheenvironmentisbyinteractingwithit.Q-learningistheprocessoflearningtheoptimalfunction
whichisatablein.Hereistheoverallalgorithm1:
Algorithm1Q-learning
InitializeQ[num_states,num_actions]arbitrarily
Observeinitialstates0
Repeat
Selectandcarryoutanactiona
Observerewardrandnewstates’
s=s’
Untilterminated
2.2DeepQ-learningNetwork
InQ-learning,thestatespaceoftenistoobigtobeputintomainmemory.Agameframeof
binaryimageshas
states,whichisimpossibletoberepresentedbyQ-table.What’smore,duringtraining,encounteringaknownstate,Q-learningjustperformarandomaction,meaningthatit’snotheuristic.Inorderovercomethesetwoproblems,justapproximatetheQ-tablewithaconvolutionalneuralnetworks(CNN)[7][8].ThisvariationofQ-learningiscalledDeepQ-learningNetwork(DQN)[9][10].
AftertrainingtheDQN,amultilayerneuralnetworkscanapproachthetraditionaloptimalQ-tableasfollowed:
Asforplayingflappybird,thescreenshotstisinputtedintotheCNN,andtheoutputsaretheQ-valueofactions,asshowninFigure3:
Figure3:
InDQN,CNN’sinputisrawgameimagewhileitsoutputsareQ-valuesQ(s,a),oneneuroncorrespondingtooneaction’sQ-value.
InordertoupdateCNN’sweight,definingthecostfunctionandgradientupdatefunctionas[9][10]:
Here,
aretheDQNparametersthatgettrainedand
arenon-updatedparametersfortheQ-valuefunction.Duringtraining,useequationtoupdatetheweightsofCNN.
Meanwhile,obtainingoptimalrewardineveryepisoderequiresthebalancebetweenexploringtheenvironmentandexploitingexperience.
-greedyapproachcanachievethistarget.Whentraining,selectarandomactionwithprobability
orotherwisechoosetheoptimalaction
.The
annealslinearlytozerowithincreaseinnumberofupdates.
2.3InputPre-processing
Workingdirectlywithrawgameframes,whichare
pixelRGBimages,canbecomputationallydemanding,soweapplyabasicpreprocessingstepaimedatreducingtheinputdimensionality.
Figure4:
Pre-processgameframes.Firstconvertframestograyimages,thendown-samplethemtospecificsize.Afterwards,convertthemtobinaryimages,finallystackuplast4framesasastate.
Inordertoimprovetheaccuracyoftheconvolutionalnetwork,thebackgroundofgamewasremovedandsubstitutedwithapureblackimagetoremovenoise.AsFigure4shows,therawgameframesarepreprocessedbyfirstconvertingtheirRGBrepresentationtogray-scaleanddown-samplingittoan
image.Thenconvertthegrayimagetobinaryimage.Inaddition,stackuplast4gameframesasastateforCNN.Thecurrentframeisoverlappedwiththepreviousframeswithslightlyreducedintensitiesandtheintensityreducesaswemovefartherawayfromthemostrecentframe.Thus,theinputimagewillgivegoodinformationonthetrajectoryonwhichthebirdiscurrentlyin.
2.4ExperienceReplayandStability
BynowwecanestimatethefuturerewardineachstateusingQ-learningandapproximatetheQ-functionusingac