Welcome back to day three of Introduction to Deep Learning. Today is an exciting day as we will explore the fusion of two important topics in this field: reinforcement learning and deep learning. This marriage between the two topics moves away from the traditional paradigm of deep learning, where models are trained on fixed datasets, and instead focuses on problems where the deep learning model explores and interacts with the dataset dynamically. The goal is to create models that can improve based on these scenarios and environments without any human supervision.
Reinforcement learning is different from other topics we have covered in this class. In supervised learning, which we covered on day one and the first lecture of day two, the input data (X) and output labels (y) are given, and the goal is to learn a functional mapping that can go from X to Y. In reinforcement learning, the data is not in the form of inputs and labels, but instead, it is presented in the form of states and actions, which are paired pieces of data. The agent observes the state of the environment and takes actions based on that state. The agent's goal is to maximize the future rewards it obtains over many time steps into the future.
Before we dive deeper into reinforcement learning, it's important to understand the terminology associated with this new type of learning problem. An agent is something that can take actions, such as a drone delivering packages or Super Mario in a video game. The environment is the world in which the agent lives and can take actions. The set of all possible actions that the agent could take is denoted as A. Observations are how the environment interacts back to the agent, and a single state is the immediate situation in which the agent finds itself. The reward is the feedback provided by the environment to measure the success or failure of the agent in that time step.
Rewards can be immediate or delayed. For example, in a video game, when Mario touches a gold coin, he gets an immediate reward, but if he takes actions that result in a reward much later, it's still a reward, and it's important to consider it. The total reward is the sum of all the rewards up until a certain time point, and the discounted reward is the sum of all the rewards up until a certain time point, multiplied by a discount factor that dampens the effects of future rewards over time.
The Q function is a critical function in reinforcement learning. It takes the current state and a possible action as inputs and returns the expected total future reward or the return of the agent that can be received from that time point up until the future. The agent can use the Q function to determine the optimal action to take in a given state.
In reinforcement learning, the agent's goal is to create a policy function that can take the current state as input and output the optimal action to take in that state. The agent can use the Q function to evaluate the policy function and determine the optimal action to take in a given state.
There are two broad categories of algorithms for solving reinforcement learning problems: value learning algorithms and policy learning algorithms. Value learning algorithms focus on learning the Q function, while policy learning algorithms focus on learning the policy function directly.
In value learning, the agent aims to learn the Q function, which represents the expected total future reward for taking a particular action in a particular state. The agent can use the Q function to determine the optimal action to take in a given state. In policy learning, the agent aims to learn the policy function directly, which takes the current state as input and outputs the optimal action to take in that state.
To illustrate the concepts of reinforcement learning, let's consider the Atari Breakout game. The agent is the paddle at the bottom of the screen, and the environment is the two-dimensional world that contains the ball and the blocks on top. The agent can take two actions: move left or move right. The objective of the game is to remove all the colored blocks on the top by hitting them with the ball.
The Q function for this game would tell us the expected total return of the agent given a particular state and action. For example, if the ball is coming straight onto the paddle, the optimal action would be to reflect the ball back up, which would result in a higher expected total return. However, if the ball is coming in at an angle and almost missing the paddle, the optimal action might be to move towards the ball and hit it with a higher velocity, which could result in breaking off more blocks and getting more points.
In conclusion, reinforcement learning is a new type of learning problem that involves an agent interacting with an environment to maximize rewards over time. The agent's goal is to learn a policy function that can take the current state as input and output the optimal action to take in that state. There are two broad categories of algorithms for solving reinforcement learning problems: value learning algorithms and policy learning algorithms. Understanding the terminology and concepts associated with reinforcement learning is crucial for success in this field.