基于深度强化学习的flappybird.docx

资源描述

基于深度强化学习的flappybird.docx

《基于深度强化学习的flappybird.docx》由会员分享，可在线阅读，更多相关《基于深度强化学习的flappybird.docx（15页珍藏版）》请在冰豆网上搜索。

基于深度强化学习的flappybird.docx

基于深度强化学习的flappybird

SHANGHAIJIAOTONGUNIVERSITY

ProjectTitle:

PlayingtheGameofFlappyBirdwithDeepReinforcementLearning

GroupNumber:

G-07

GroupMembers:

WangWenqing116032910080

GaoXiaoning116032910032

QianChen116032910073

PlayingtheGameofFlappyBirdwithDeepReinforcementLearning

Abstract

LettingmachineplaygameshasbeenoneofthepopulartopicsinAItoday.Usinggametheoryandsearchalgorithmstoplaygamesrequiresspecificdomainknowledge,lackingscalability.Inthisproject,weutilizeaconvolutionalneuralnetworktorepresenttheenvironmentofgames,updatingitsparameterswithQ-learning,areinforcementlearningalgorithm.WecallthisoverallalgorithmasdeepreinforcementlearningorDeepQ-learningNetwork（DQN）.Moreover,weonlyusetherawimagesofthegameofflappybirdastheinputofDQN,whichguaranteesthescalabilityforothergames.Aftertrainingwithsometricks,DQNcangreatlyoutperformhumanbeings.

1Introduction

Flappybirdisapopulargameintheworldrecentyears.Thegoalofplayersisguidingthebirdonscreentopassthegapconstructedbytwopipesbytappingscreen.Iftheplayertapthescreen,thebirdwilljumpup,andiftheplayerdonothing,thebirdwillfalldownataconstantrate.Thegamewillbeoverwhenthebirdcrashonpipesorground,whilethescoreswillbeaddedonewhenthebirdpassthroughthegap.InFigure1,therearethreedifferentstateofbird.Figure1（a）representsthenormalflightstate,（b）representsthecrashstate,（c）representsthepassingstate.

（a）（b）（c）

Figure1:

（a）normalflightstate（b）crashstate（c）passingstate

OurgoalinthispaperistodesignanagenttoplayFlappybirdautomaticallywiththesameinputcomparingtohumanplayer,whichmeansthatweuserawimagesandrewardstoteachouragenttolearnhowtoplaythisgame.Inspiredby[1],weproposeadeepreinforcementlearningarchitecturetolearnandplaythisgame.

Recentyears,ahugeamountofworkhasbeendoneondeeplearningincomputervision[6].Deeplearningextractshighdimensionfeaturesfromrawimages.Therefore,itisnaturetoaskwhetherthedeeplearningcanbeusedinreinforcementlearning.However,therearefourchallengesinusingdeeplearning.Firstly,mostsuccessfuldeeplearningapplicationstodatehaverequiredlargeamountsofhand-labelledtrainingdata.RLalgorithms,ontheotherhand,mustbeabletolearnfromascalarrewardsignalthatisfrequentlysparse,noisyanddelayed.Secondly,thedelaybetweenactionsandresultingrewards,whichcanbethousandsoftimestepslong,seemsparticularlydauntingwhencomparedtothedirectassociationbetweeninputsandtargetsfoundinsupervisedlearning.Thethirdissueisthatmostdeeplearningalgorithmsassumethedatasamplestobeindependent,whileinreinforcementlearningonetypicallyencounterssequencesofhighlycorrelatedstates.Furthermore,inRLthedatadistributionchangesasthealgorithmlearnsnewbehaviors,whichcanbeproblematicfordeeplearningmethodsthatassumeafixedunderlyingdistribution.

ThispaperwilldemonstratethatusingConvolutionalNeuralNetwork（CNN）canovercomethosechallengementionedaboveandlearnsuccessfulcontrolpolicesfromrawimagesdatainthegameFlappybird.ThisnetworkistrainedwithavariantoftheQ-learningalgorithm[6].ByusingDeepQ-learningNetwork（DQN）,weconstructtheagenttomakerightdecisionsonthegameflappybirdbarelyaccordingtoconsequentrawimages.

2DeepQ-learningNetwork

Recentbreakthroughsincomputervisionhavereliedonefficientlytrainingdeepneuralnetworksonverylargetrainingsets.Byfeedingsufficientdataintodeepneuralnetworks,itisoftenpossibletolearnbetterrepresentationsthanhandcraftedfeatures[2][3].Thesesuccessesmotivateustoconnectareinforcementlearningalgorithmtoadeepneuralnetwork,whichoperatesdirectlyonrawimagesandefficientlyupdateparametersbyusingstochasticgradientdescent.

Inthefollowingsection,wedescribetheDeepQ-learningNetworkalgorithm（DQN）andhowitsmodelisparameterized.

2.1Q-learning

2.1.1ReinforcementLearningProblem

Q-learningisaspecificalgorithmofreinforcementlearning（RL）.AsFigure2show,anagentinteractswithitsenvironmentindiscretetimesteps.Ateachtimet,theagentreceivesanstate

andareward

.Itthenchoosesanaction

fromthesetofactionsavailable,whichissubsequentlysenttotheenvironment.Theenvironmentmovestoanewstate

andthereward

associatedwiththetransition

isdetermined[4].

Figure2:

TraditionalReinforcementLearningscenario

Thegoalofanagentistocollectasmuchrewardaspossible.Theagentcanchooseanyactionasafunctionofthehistoryanditcanevenrandomizeitsactionselection.Notethatinorderto

展开阅读全文