基于深度强化学习的flappybird.docx

资源描述

基于深度强化学习的flappybird.docx

《基于深度强化学习的flappybird.docx》由会员分享，可在线阅读，更多相关《基于深度强化学习的flappybird.docx（15页珍藏版）》请在冰豆网上搜索。

基于深度强化学习的flappybird.docx

基于深度强化学习的flappybird

SHANGHAIJIAOTONGUNIVERSITY

ProjectTitle:

PlayingtheGameofFlappyBirdwithDeepReinforcementLearning

GroupNumber:

G-07

GroupMembers:

WangWenqing116032910080

GaoXiaoning116032910032

QianChen11603

PlayingtheGameofFlappyBirdwithDeepReinforcementLearning

Abstract

LettingmachineplaygameshasbeenoneofthepopulartopicsinAItoday.Usinggametheoryandsearchalgorithmstoplaygamesrequiresspecificdomainknowledge,lackingscalability.Inthisproject,weutilizeaconvolutionalneuralnetworktorepresenttheenvironmentofgames,updatingitsparameterswithQ-learning,areinforcementlearningalgorithm.WecallthisoverallalgorithmasdeepreinforcementlearningorDeepQ-learningNetwork（DQN）.Moreover,weonlyusetherawimagesofthegameofflappybirdastheinputofDQN,whichguaranteesthescalabilityforothergames.Aftertrainingwithsometricks,DQNcangreatlyoutperformhumanbeings.

1Introduction

Flappybirdisapopulargameintheworldrecentyears.Thegoalofplayersisguidingthebirdonscreentopassthegapconstructedbytwopipesbytappingscreen.Iftheplayertapthescreen,thebirdwilljumpup,andiftheplayerdonothing,thebirdwillfalldownataconstantrate.Thegamewillbeoverwhenthebirdcrashonpipesorground,whilethescoreswillbeaddedonewhenthebirdpassthroughthegap.InFigure1,therearethreedifferentstateofbird.Figure1（a）representsthenormalflightstate,（b）representsthecrashstate,（c）representsthepassingstate.

（a）（b）（c）

Figure1:

（a）normalflightstate（b）crashstate（c）passingstate

OurgoalinthispaperistodesignanagenttoplayFlappybirdautomaticallywiththesameinputcomparingtohumanplayer,whichmeansthatweuserawimagesandrewardstoteachouragenttolearnhowtoplaythisgame.Inspiredby[1],weproposeadeepreinforcementlearningarchitecturetolearnandplaythisgame.

Recentyears,ahugeamountofworkhasbeendoneondeeplearningincomputervision[6].Deeplearningextractshighdimensionfeaturesfromrawimages.Therefore,itisnaturetoaskwhetherthedeeplearningcanbeusedinreinforcementlearning.However,therearefourchallengesinusingdeeplearning.Firstly,mostsuccessfuldeeplearningapplicationstodatehaverequiredlargeamountsofhand-labelledtrainingdata.RLalgorithms,ontheotherhand,mustbeabletolearnfromascalarrewardsignalthatisfrequentlysparse,noisyanddelayed.Secondly,thedelaybetweenactionsandresultingrewards,whichcanbethousandsoftimestepslong,seemsparticularlydauntingwhencomparedtothedirectassociationbetweeninputsandtargetsfoundinsupervisedlearning.Thethirdissueisthatmostdeeplearningalgorithmsassumethedatasamplestobeindependent,whileinreinforcementlearningonetypicallyencounterssequencesofhighlycorrelatedstates.Furthermore,inRLthedatadistributionchangesasthealgorithmlearnsnewbehaviors,whichcanbeproblematicfordeeplearningmethodsthatassumeafixedunderlyingdistribution.

ThispaperwilldemonstratethatusingConvolutionalNeuralNetwork（CNN）canovercomethosechallengementionedaboveandlearnsuccessfulcontrolpolicesfromrawimagesdatainthegameFlappybird.ThisnetworkistrainedwithavariantoftheQ-learningalgorithm[6].ByusingDeepQ-learningNetwork（DQN）,weconstructtheagenttomakerightdecisionsonthegameflappybirdbarelyaccordingtoconsequentrawimages.

2DeepQ-learningNetwork

Recentbreakthroughsincomputervisionhavereliedonefficientlytrainingdeepneuralnetworksonverylargetrainingsets.Byfeedingsufficientdataintodeepneuralnetworks,itisoftenpossibletolearnbetterrepresentationsthanhandcraftedfeatures[2][3].Thesesuccessesmotivateustoconnectareinforcementlearningalgorithmtoadeepneuralnetwork,whichoperatesdirectlyonrawimagesandefficientlyupdateparametersbyusingstochasticgradientdescent.

Inthefollowingsection,wedescribetheDeepQ-learningNetworkalgorithm（DQN）andhowitsmodelisparameterized.

2.1Q-learning

2.1.1ReinforcementLearningProblem

Q-learningisaspecificalgorithmofreinforcementlearning（RL）.AsFigure2show,anagentinteractswithitsenvironmentindiscretetimesteps.Ateachtimet,theagentreceivesanstate

andareward

.Itthenchoosesanaction

fromthesetofactionsavailable,whichissubsequentlysenttotheenvironment.Theenvironmentmovestoanewstate

andthereward

associatedwiththetransition

isdetermined[4].

Figure2:

TraditionalReinforcementLearningscenario

Thegoalofanagentistocollectasmuchrewardaspossible.Theagentcanchooseanyactionasafunctionofthehistoryanditcanevenrandomizeitsactionselection.Notethatinordertoactnearoptimally,theagentmustreasonaboutthelongtermconsequencesofitsactions（i.e.,maximizethefutureincome）,althoughtheimmediaterewardassociatedwiththismightbenegative[5].

2.1.2Q-learningFormulation[6]

InQ-learningproblem,thesetofstatesandactions,togetherwithrulesfortransitioningfromonestatetoanother,makeupaMarkovdecisionprocess.Oneepisodeofthisprocess（e.g.onegame）formsafinitesequenceofstates,actionsandrewards:

Here

representsthestate,

istheactionand

istherewardafterperformingtheaction

.Theepisodeendswithterminalstate

.Toperformwellinthelong-term,weneedtotakeintoaccountnotonlytheimmediaterewards,butalsothefuturerewardswearegoingtoget.Definethetotalfuturerewardfromtimepointtonwardas:

Inordertoensurethedivergenceandbalancetheimmediaterewardandfuturereward,totalrewardmustusediscountedfuturereward:

Here

isthediscountfactorbetween0and1,themoreintothefuturetherewardis,thelesswetakeitintoconsideration.Transformingequationcanget:

InQ-learning,defineafunction

representingthemaximumdiscountedfuturerewardwhenweperformaction

instate:

ItiscalledQ-function,becauseitrepresentsthe“quality”ofacertainactioninagivenstate.Agoodstrategyforanagentwouldbetoalwayschooseanactionthatmaximizesthediscountedfuturereward:

Hereπrepresentsthepolicy,therulehowwechooseanactionineachstate.Givenatransition

equationcangetfollowingbellmanequation-maximumfuturerewardforthisstateandactionistheimmediaterewardplusmaximumfuturerewardforthenextstate:

Theonlywaytocollectinformationabouttheenvironmentisbyinteractingwithit.Q-learningistheprocessoflearningtheoptimalfunction

whichisatablein.Hereistheoverallalgorithm1:

Algorithm1Q-learning

InitializeQ[num_states,num_actions]arbitrarily

Observeinitialstates0

Repeat

Selectandcarryoutanactiona

Observerewardrandnewstates’

s=s’

Untilterminated

2.2DeepQ-learningNetwork

InQ-learning,thestatespaceoftenistoobigtobeputintomainmemory.Agameframeof

binaryimageshas

states,whichisimpossibletoberepresentedbyQ-table.What’smore,duringtraining,encounteringaknownstate,Q-learningjustperformarandomaction,meaningthatit’snotheuristic.Inorderovercomethesetwoproblems,justapproximatetheQ-tablewithaconvolutionalneuralnetworks（CNN）[7][8].ThisvariationofQ-learningiscalledDeepQ-learningNetwork（DQN）[9][10].

AftertrainingtheDQN,amultilayerneuralnetworkscanapproachthetraditionaloptimalQ-tableasfollowed:

Asforplayingflappybird,thescreenshotstisinputtedintotheCNN,andtheoutputsaretheQ-valueofactions,asshowninFigure3:

Figure3:

InDQN,CNN’sinputisrawgameimagewhileitsoutputsareQ-valuesQ（s,a）,oneneuroncorrespondingtooneaction’sQ-value.

InordertoupdateCNN’sweight,definingthecostfunctionandgradientupdatefunctionas[9][10]:

Here,

aretheDQNparametersthatgettrainedand

arenon-updatedparametersfortheQ-valuefunction.Duringtraining,useequationtoupdatetheweightsofCNN.

Meanwhile,obtainingoptimalrewardineveryepisoderequiresthebalancebetweenexploringtheenvironmentandexploitingexperience.

-greedyapproachcanachievethistarget.Whentraining,selectarandomactionwithprobability

orotherwisechoosetheoptimalaction

.The

annealslinearlytozerowithincreaseinnumberofupdates.

2.3InputPre-processing

Workingdirectlywithrawgameframes,whichare

pixelRGBimages,canbecomputationallydemanding,soweapplyabasicpreprocessingstepaimedatreducingtheinputdimensionality.

Figure4:

Pre-processgameframes.Firstconvertframestograyimages,thendown-samplethemtospecificsize.Afterwards,convertthemtobinaryimages,finallystackuplast4framesasastate.

Inordertoimprovetheaccuracyoftheconvolutionalnetwork,thebackgroundofgamewasremovedandsubstitutedwithapureblackimagetoremovenoise.AsFigure4shows,therawgameframesarepreprocessedbyfirstconvertingtheirRGBrepresentationtogray-scaleanddown-samplingittoan

image.Thenconvertthegrayimagetobinaryimage.Inaddition,stackuplast4gameframesasastateforCNN.Thecurrentframeisoverlappedwiththepreviousframeswithslightlyreducedintensitiesandtheintensityreducesaswemovefartherawayfromthemostrecentframe.Thus,theinputimagewillgivegoodinformationonthetrajectoryonwhichthebirdiscurrentlyin.

2.4ExperienceReplayandStability

BynowwecanestimatethefuturerewardineachstateusingQ-learningandapproximatetheQ-functionusingaconvolut

展开阅读全文