Reinforcement Learning (RL)

Reinforcement Learning (RL)

Reinforcement Learning (RL)

  • is the science of decision making
  • is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize some notion of cumulative reward
  • almost all RL problems can be formulated as aΒ Markov Decision Process (MDP)

RL - Learning Paradigms

  • there is no supervisor, only a scalar reward signal
  • feedback may not be instantaneous (i.e. delayed)
  • time-series related (sequential, not i.i.d. data)
  • agent's action affects the subsequent data it receives

RL - Components

rewards

  • a reward 𝑅𝑑 is a scalar feedback signal
  • indicates how well the agent is doing at timestep 𝑑
  • the agent's job is to maximize cumulative reward

RL is based on the reward hypothesis - all goals can be described by the maximization of expected cumulative reward

sequential decision making

  • the goal is to select actions to maximize total future reward
  • actions may have long-term consequences
  • the reward may not be instantaneous (i.e. delayed)
  • it may be better to sacrifice immediate reward at the cost of long-term reward

at each time-step 𝑑 the agent:

  • receives reward 𝑅𝑑
  • receives observation 𝑂𝑑
  • does an action 𝐴𝑑

history

  • is the sequence of rewards, observations, and actions from time-step 1 to 𝑑
  • 𝐻1𝑑 =Β [𝑅1, 𝑂1, 𝐴1, 𝑅2, 𝑂2, 𝐴2, ..., 𝑅𝑑, 𝑂𝑑, 𝐴𝑑]

what happens next depends on the history

  • agent selects actions
  • the environment selects observations & rewards

state

  • is the summary of history used to determine what happens next
  • is a function of history:
    • 𝑆𝑑 = 𝑓(𝐻1𝑑)

2 state types:

  • environment state 𝑆𝑑𝑒
    • is the environment's internal state representation
    • is whatever data the environment uses to pick the next observation & reward
    • has the Markov property
  • agent state π‘†π‘‘π‘Ž
    • is the agent's internal state representation
    • is whatever data the agent uses to pick the next action
    • is the information used by RL algorithms
    • it can be any function of history
      • π‘†π‘‘π‘ŽΒ = 𝑓(𝐻1𝑑)

state with Markov property

  • a state 𝑆𝑖 has Markov property iff: 𝐏(𝑆𝑑+1|𝑆1, ..., 𝑆𝑑) = 𝐏(𝑆𝑑+1|𝑆𝑑)

  • the entire history from time 1 to 𝑑 (i.e. 𝐻1𝑑) has the Markov property

information/markov state:

  • has the Markov property
  • contains all useful information from the history 𝐻1𝑑
  • once the information state is known, the history is no longer needed
  • is a sufficient statistic that can be used in determining the future
Environment TypesDescription
Fully Observable Environment State
  • agent directly observes environment state:

    • 𝑂𝑑 = π‘†π‘‘π‘ŽΒ = 𝑆𝑑𝑒

    • agent state = environment state
  • formally this is aΒ Markov Decision Process (MDP)
Partially Observable Environment State
  • agent indirectly observes environment state

    • π‘†π‘‘π‘Ž β‰  𝑆𝑑𝑒
    • agent state β‰  environment state

  • formally this is aΒ Partially Observable Markov Decision Process (POMDP)

  • agent must construct its own environment state representation 𝑆𝑑𝑒, such as:

    Representation TypeDescription
    complete historyπ‘†π‘‘π‘ŽΒ = 𝐻1𝑑
    beliefs of environment stateπ‘†π‘‘π‘ŽΒ = (𝐏(𝑆𝑑𝑒=𝑠1), ..., 𝐏(𝑆𝑑𝑒=𝑠𝑛))
    recurrent neural networkπ‘†π‘‘π‘ŽΒ = 𝜎(𝑆𝑑-1π‘Žπ‘Šπ‘ Β + π‘‚π‘‘π‘Šπ‘œ) take a linear combination of agent-state at previous timestep with the current observation

RL Agent Components

componentdescription
policy

a function that tells what action the agent should take in a given state

  • is a map from state to action
  • policy types:
    • deterministic policy: π‘Ž = πœ‹(𝑠)
    • stochastic policy: πœ‹(π‘Ž|𝑠) = 𝐏(𝐴=π‘Ž|𝑆=𝑠)
value function

a function that tells how good each state and/or action is

  • is a prediction of future reward

  • used to evaluate the goodness/badness of states, and therefore used to select between actions

  • π‘‰πœ‹(𝑠) = π„πœ‹[ 𝛾0𝑅𝑑+0Β + 𝛾1𝑅𝑑+1Β + 𝛾2𝑅𝑑+2Β + ... | 𝑆𝑑=𝑠 ]

model

the agent's representation of the environment

  • optional: there are model-free agents
  • a model predicts what the environment will do next
  • transitions 𝑇 predict the next state
    • 𝑇𝑠𝑠'π‘ŽΒ = 𝐏(𝑆'=𝑠'|𝑆=𝑠,𝐴=π‘Ž)
  • rewards 𝑅 predicts the next reward
    • π‘…π‘ π‘ŽΒ = 𝐄[𝑅|𝑆=𝑠,𝐴=π‘Ž]

RL Agent - Types

containing value function and/or policy function:

  • value-based - an agent that stores the value function (policy is implicit, just readout the value function)
  • policy-based - an agent that stores the policy (no value function)
  • actor-critic - stores both the policy and reward

containing a model of the environment:

  • model-free - policy and/or value function
  • model-based - policy and/or value function

RL - Dichotomies

Dichotomy

Description
Reinforcement Learning
vs
Planning

reinforcement learning

  • the environment is initially unknown
  • the agent interacts with the environment
  • the agent improves its policy

planning

  • the model of the environment is known
  • the agent performs computations with the model (without any external interaction)
  • the agent improves its policy
  • aka: reasoning and search
Exploration
vs
Exploitation
  • exploration - explores unknown information about the environment which would give up rewards
  • exploitation - exploit known information about the environment to maximize reward
Prediction
vs
Control
  • prediction - given a policy, evaluate the future

  • control - find the best policy that optimizes the future rewards

In RL you solve prediction-problem in order to solve the control-problem

RL - Other

RL - Resources