Reinforcement Learning (RL)
Reinforcement Learning (RL)
- is the science of decision making
- is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize some notion of cumulative reward
- almost all RL problems can be formulated as aΒ Markov Decision Process (MDP)
RL - Learning Paradigms
- there is no supervisor, only a scalar reward signal
- feedback may not be instantaneous (i.e. delayed)
- time-series related (sequential, not i.i.d. data)
- agent's action affects the subsequent data it receives
RL - Components
rewards
- a reward π π‘ is a scalar feedback signal
- indicates how well the agent is doing at timestep π‘
- the agent's job is to maximize cumulative reward
RL is based on the reward hypothesis - all goals can be described by the maximization of expected cumulative reward
sequential decision making
- the goal is to select actions to maximize total future reward
- actions may have long-term consequences
- the reward may not be instantaneous (i.e. delayed)
- it may be better to sacrifice immediate reward at the cost of long-term reward
at each time-step π‘ the agent:
- receives reward π π‘
- receives observation ππ‘
- does an action π΄π‘
history
- is the sequence of rewards, observations, and actions from time-step 1 to π‘
- π»1π‘ =Β [π 1, π1, π΄1, π 2, π2, π΄2, ..., π π‘, ππ‘, π΄π‘]
what happens next depends on the history
- agent selects actions
- the environment selects observations & rewards
state
- is the summary of history used to determine what happens next
- is a function of history:
- ππ‘ = π(π»1π‘)
2 state types:
- environment state ππ‘π
- is the environment's internal state representation
- is whatever data the environment uses to pick the next observation & reward
- has the Markov property
- agent state ππ‘π
- is the agent's internal state representation
- is whatever data the agent uses to pick the next action
- is the information used by RL algorithms
- it can be any function of history
- ππ‘πΒ = π(π»1π‘)
state with Markov property
a stateΒ ππ has Markov property iff: π(ππ‘+1|π1, ..., ππ‘) =Β π(ππ‘+1|ππ‘)
- the entire history from time 1 to π‘ (i.e. π»1π‘) has the Markov property
information/markov state:
- has the Markov property
- contains all useful information from the historyΒ π»1π‘
- once the information state is known, the history is no longer needed
- is a sufficient statistic that can be used in determining the future
Environment Types | Description | ||||||||
---|---|---|---|---|---|---|---|---|---|
Fully Observable Environment State |
| ||||||||
Partially Observable Environment State |
|
RL Agent Components
component | description |
---|---|
policy | a function that tells what action the agent should take in a given state
|
value function | a function that tells how good each state and/or action is
|
model | the agent's representation of the environment
|
RL Agent - Types
containing value function and/or policy function:
- value-based - an agent that stores the value function (policy is implicit, just readout the value function)
- policy-based - an agent that stores the policy (no value function)
- actor-critic - stores both the policy and reward
containing a model of the environment:
- model-free - policy and/or value function
- model-based - policy and/or value function
RL - Dichotomies
Dichotomy | Description |
---|---|
Reinforcement Learning vs Planning | reinforcement learning
planning
|
Exploration vs Exploitation |
|
Prediction vs Control |
In RL you solve prediction-problem in order to solve the control-problem |
RL - Other
- AlphaGo Fan/Lee/Master/Zero
- Deep Q Networks (DQN)
- Multi/K-Armed Bandit Problem
- Policy Gradient Methods
- Proximal Policy Optimization (PPO)
- Q-Function
- Q-Learning
- Reinforcement Learning from Human Feedback (RLHF)
- RL - Applications
- RL - Example (Tic-Tac-Toe)
- RL - Human Priors for Playing Video Games
- Selective Bootstrap Adaptation