Reinforcement Learning (RL)

is the science of decision making
is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize some notion of cumulative reward
almost all RL problems can be formulated as a Markov Decision Process (MDP)

RL - Learning Paradigms

there is no supervisor, only a scalar reward signal
feedback may not be instantaneous (i.e. delayed)
time-series related (sequential, not i.i.d. data)
agent's action affects the subsequent data it receives

RL - Components

rewards

a reward 𝑅_𝑡 is a scalar feedback signal
indicates how well the agent is doing at timestep 𝑡
the agent's job is to maximize cumulative reward

RL is based on the reward hypothesis - all goals can be described by the maximization of expected cumulative reward

sequential decision making

the goal is to select actions to maximize total future reward
actions may have long-term consequences
the reward may not be instantaneous (i.e. delayed)
it may be better to sacrifice immediate reward at the cost of long-term reward

at each time-step 𝑡 the agent:

receives reward 𝑅_𝑡
receives observation 𝑂_𝑡
does an action 𝐴_𝑡

history

is the sequence of rewards, observations, and actions from time-step 1 to 𝑡
𝐻_1𝑡 = [𝑅₁, 𝑂₁, 𝐴₁, 𝑅₂, 𝑂₂, 𝐴₂, ..., 𝑅_𝑡, 𝑂_𝑡, 𝐴_𝑡]

what happens next depends on the history

agent selects actions
the environment selects observations & rewards

state

is the summary of history used to determine what happens next
is a function of history:
- 𝑆_𝑡 = 𝑓(𝐻_1𝑡)

2 state types:

environment state 𝑆_𝑡^𝑒
- is the environment's internal state representation
- is whatever data the environment uses to pick the next observation & reward
- has the Markov property
agent state 𝑆_𝑡^𝑎
- is the agent's internal state representation
- is whatever data the agent uses to pick the next action
- is the information used by RL algorithms
- it can be any function of history
  - 𝑆_𝑡^𝑎 = 𝑓(𝐻_1𝑡)

state with Markov property

a state 𝑆_𝑖 has Markov property iff: 𝐏(𝑆_𝑡+1|𝑆₁, ..., 𝑆_𝑡) = 𝐏(𝑆_𝑡+1|𝑆_𝑡)
the entire history from time 1 to 𝑡 (i.e. 𝐻_1𝑡) has the Markov property

information/markov state:

has the Markov property
contains all useful information from the history 𝐻_1𝑡
once the information state is known, the history is no longer needed
is a sufficient statistic that can be used in determining the future

Environment Types

Description

Fully Observable Environment State

agent directly observes environment state:
- 𝑂_𝑡 = 𝑆_𝑡^𝑎 = 𝑆_𝑡^𝑒
- agent state = environment state
formally this is a Markov Decision Process (MDP)

Partially Observable Environment State

agent indirectly observes environment state
- 𝑆_𝑡^𝑎 ≠ 𝑆_𝑡^𝑒
- agent state ≠ environment state
formally this is a Partially Observable Markov Decision Process (POMDP)

agent must construct its own environment state representation 𝑆_𝑡^𝑒, such as:

Representation Type	Description
complete history	𝑆_𝑡^𝑎 = 𝐻_1𝑡
beliefs of environment state	𝑆_𝑡^𝑎 = (𝐏(𝑆_𝑡^𝑒=𝑠¹), ..., 𝐏(𝑆_𝑡^𝑒=𝑠^𝑛))
recurrent neural network	𝑆_𝑡^𝑎 = 𝜎(𝑆_𝑡-1^𝑎𝑊_𝑠 + 𝑂_𝑡𝑊_𝑜) take a linear combination of agent-state at previous timestep with the current observation

RL Agent Components

component

description

policy

a function that tells what action the agent should take in a given state

is a map from state to action
policy types:
- deterministic policy: 𝑎 = 𝜋(𝑠)
- stochastic policy: 𝜋(𝑎|𝑠) = 𝐏(𝐴=𝑎|𝑆=𝑠)

value function

a function that tells how good each state and/or action is

is a prediction of future reward
used to evaluate the goodness/badness of states, and therefore used to select between actions
𝑉_𝜋(𝑠) = 𝐄_𝜋[ 𝛾⁰𝑅_𝑡+0 + 𝛾¹𝑅_𝑡+1+ 𝛾²𝑅_𝑡+2 + ... | 𝑆_𝑡=𝑠 ]

model

the agent's representation of the environment

optional: there are model-free agents
a model predicts what the environment will do next
transitions 𝑇 predict the next state
- 𝑇_𝑠𝑠'^𝑎 = 𝐏(𝑆'=𝑠'|𝑆=𝑠,𝐴=𝑎)
rewards 𝑅 predicts the next reward
- 𝑅_𝑠^𝑎 = 𝐄[𝑅|𝑆=𝑠,𝐴=𝑎]

RL Agent - Types

containing value function and/or policy function:

value-based - an agent that stores the value function (policy is implicit, just readout the value function)
policy-based - an agent that stores the policy (no value function)
actor-critic - stores both the policy and reward

containing a model of the environment:

model-free - policy and/or value function
model-based - policy and/or value function

RL - Dichotomies

Dichotomy

Description

Reinforcement Learning
vs
Planning

reinforcement learning

the environment is initially unknown
the agent interacts with the environment
the agent improves its policy

planning

the model of the environment is known
the agent performs computations with the model (without any external interaction)
the agent improves its policy
aka: reasoning and search

Exploration
vs
Exploitation

exploration - explores unknown information about the environment which would give up rewards
exploitation - exploit known information about the environment to maximize reward

Prediction
vs
Control

prediction - given a policy, evaluate the future
control - find the best policy that optimizes the future rewards

In RL you solve prediction-problem in order to solve the control-problem

RL - Other

RL - Resources

Reinforcement Learning: An Introduction ~ Richard S. Sutton and Andrew G. Barto
Hado Van Hasselt - YouTube Lectures
David Silver - YouTube Lectures
- Written Article
Deep Reinforcement Learning: Pong from Pixels