1、基于深度强化学习的flappybirdSHANGHAI JIAO TONG UNIVERSITYProject Title: Playing the Game of Flappy Bird with Deep Reinforcement LearningGroup Number: G-07Group Members: Wang Wenqing 116032910080Gao Xiaoning 116032910032Qian Chen 11603Playing the Game of Flappy Bird with Deep Reinforcement LearningAbstractLet
2、ting machine play games has been one of the popular topics in AI today. Using game theory and search algorithms to play games requires specific domain knowledge, lacking scalability. In this project, we utilize a convolutional neural network to represent the environment of games, updating its parame
3、ters with Q-learning, a reinforcement learning algorithm. We call this overall algorithm as deep reinforcement learning or Deep Q-learning Network(DQN). Moreover, we only use the raw images of the game of flappy bird as the input of DQN, which guarantees the scalability for other games. After traini
4、ng with some tricks, DQN can greatly outperform human beings.1 IntroductionFlappy bird is a popular game in the world recent years. The goal of players is guiding the bird on screen to pass the gap constructed by two pipes by tapping screen. If the player tap the screen, the bird will jump up, and i
5、f the player do nothing, the bird will fall down at a constant rate. The game will be over when the bird crash on pipes or ground, while the scores will be added one when the bird pass through the gap. In Figure1, there are three different state of bird. Figure 1 (a) represents the normal flight sta
6、te, (b) represents the crash state, (c) represents the passing state. (a) (b) (c)Figure 1: (a) normal flight state (b) crash state (c) passing stateOur goal in this paper is to design an agent to play Flappy bird automatically with the same input comparing to human player, which means that we use ra
7、w images and rewards to teach our agent to learn how to play this game. Inspired by 1, we propose a deep reinforcement learning architecture to learn and play this game.Recent years, a huge amount of work has been done on deep learning in computer vision 6. Deep learning extracts high dimension feat
8、ures from raw images. Therefore, it is nature to ask whether the deep learning can be used in reinforcement learning. However, there are four challenges in using deep learning. Firstly, most successful deep learning applications to date have required large amounts of hand-labelled training data. RL
9、algorithms, on the other hand, must be able to learn from a scalar reward signal that is frequently sparse, noisy and delayed. Secondly, the delay between actions and resulting rewards, which can be thousands of time steps long, seems particularly daunting when compared to the direct association bet
10、ween inputs and targets found in supervised learning. The third issue is that most deep learning algorithms assume the data samples to be independent, while in reinforcement learning one typically encounters sequences of highly correlated states. Furthermore, in RL the data distribution changes as t
11、he algorithm learns new behaviors, which can be problematic for deep learning methods that assume a fixed underlying distribution. This paper will demonstrate that using Convolutional Neural Network (CNN) can overcome those challenge mentioned above and learn successful control polices from raw imag
12、es data in the game Flappy bird. This network is trained with a variant of the Q-learning algorithm 6. By using Deep Q-learning Network (DQN), we construct the agent to make right decisions on the game flappy bird barely according to consequent raw images.2 Deep Q-learning NetworkRecent breakthrough
13、s in computer vision have relied on efficiently training deep neural networks on very large training sets. By feeding sufficient data into deep neural networks, it is often possible to learn better representations than handcrafted features 23. These successes motivate us to connect a reinforcement l
14、earning algorithm to a deep neural network, which operates directly on raw images and efficiently update parameters by using stochastic gradient descent. In the following section, we describe the Deep Q-learning Network algorithm (DQN) and how its model is parameterized. 2.1 Q-learning2.1.1 Reinforc
15、ement Learning ProblemQ-learning is a specific algorithm of reinforcement learning (RL). As Figure 2 show, an agent interacts with its environment in discrete time steps. At each time t, the agent receives an state and a reward . It then chooses an action from the set of actions available, which is
16、subsequently sent to the environment. The environment moves to a new state and the reward associated with the transition is determined 4. Figure 2: Traditional Reinforcement Learning scenarioThe goal of an agent is to collect as much reward as possible. The agent can choose any action as a function
17、of the history and it can even randomize its action selection. Note that in order to act near optimally, the agent must reason about the long term consequences of its actions (i.e., maximize the future income), although the immediate reward associated with this might be negative 5.2.1.2 Q-learning F
18、ormulation 6 In Q-learning problem, the set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process. One episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards:Hererepresents the state, is the a
19、ction and is the reward after performing the action. The episode ends with terminal state. To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. Define the total future reward from time point t onward as: In or
20、der to ensure the divergence and balance the immediate reward and future reward, total reward must use discounted future reward: Here is the discount factor between 0 and 1, the more into the future the reward is, the less we take it into consideration. Transforming equation can get: In Q-learning,
21、define a function representing the maximum discounted future reward when we perform action in state: It is called Q-function, because it represents the “quality” of a certain action in a given state. A good strategy for an agent would be to always choose an action that maximizes the discounted futur
22、e reward: Here represents the policy, the rule how we choose an action in each state. Given a transition, equation can get following bellman equation - maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state: The only way to collect infor
23、mation about the environment is by interacting with it. Q-learning is the process of learning the optimal function, which is a table in. Here is the overall algorithm 1: Algorithm 1 Q-learning Initialize Qnum_states, num_actions arbitrarilyObserve initial state s0Repeat Select and carry out an actio
24、n a Observe reward r and new state s s = sUntil terminated 2.2 Deep Q-learning NetworkIn Q-learning, the state space often is too big to be put into main memory. A game frame of binary images has states, which is impossible to be represented by Q-table. Whats more, during training, encountering a kn
25、own state, Q-learning just perform a random action, meaning that its not heuristic. In order overcome these two problems, just approximate the Q-table with a convolutional neural networks (CNN) 78. This variation of Q-learning is called Deep Q-learning Network (DQN) 910. After training the DQN, a mu
26、ltilayer neural networks can approach the traditional optimal Q-table as followed: As for playing flappy bird, the screenshot st is inputted into the CNN, and the outputs are the Q-value of actions, as shown in Figure 3:Figure 3: In DQN, CNNs input is raw game image while its outputs are Q-values Q(
27、s, a), one neuron corresponding to one actions Q-value.In order to update CNNs weight, defining the cost function and gradient update function as 910: Here, are the DQN parameters that get trained and are non-updated parameters for the Q-value function. During training, use equation to update the we
28、ights of CNN.Meanwhile, obtaining optimal reward in every episode requires the balance between exploring the environment and exploiting experience.-greedy approach can achieve this target. When training, select a random action with probability or otherwise choose the optimal action . Theanneals line
29、arly to zero with increase in number of updates.2.3 Input Pre-processingWorking directly with raw game frames, which are pixel RGB images, can be computationally demanding, so we apply a basic preprocessing step aimed at reducing the input dimensionality. Figure 4: Pre-process game frames. First con
30、vert frames to gray images, then down-sample them to specific size. Afterwards, convert them to binary images, finally stack up last 4 frames as a state.In order to improve the accuracy of the convolutional network, the background of game was removed and substituted with a pure black image to remove
31、 noise. As Figure 4 shows, the raw game frames are preprocessed by first converting their RGB representation to gray-scale and down-sampling it to an image. Then convert the gray image to binary image. In addition, stack up last 4 game frames as a state for CNN. The current frame is overlapped with
32、the previous frames with slightly reduced intensities and the intensity reduces as we move farther away from the most recent frame. Thus, the input image will give good information on the trajectory on which the bird is currently in.2.4 Experience Replay and StabilityBy now we can estimate the future reward in each state using Q-learning and approximate the Q-function using a convolut
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1