Skip to content

Lecture1: Intro to RL

[2025-11-04](file:///workspace/d047dbd0-d7d7-404e-8dc2-62572c04acd5/wl-Rrmz1Mj5CWeLLEwT_3)

  • Characteristics of RL
    • There's no supervisor, but only a reward signal
    • Feedback is delayed, not instantaneous
    • Time really matters
    • Agent's actions affect the subsequent data it receives

  • What we need is to figure out the algorithm in the Agent's "mind"

  • The valid definition of state would be to only look at the last observation
    • Environment State

    • Agent State

    • Information(Markov) State

      • The Markov state contains enough information to characterize all future rewards

  • Major Components of an RL Agent

    • Policy

    • Value function

      • The behavior that maximizes the value is the one that correctly trades off the agent's risks so as to get the maximum amount of reward going into the future, and that just automatically emerges from the performance.
    • Model

  • RL is like trial-and-error learning
  • The agent should discover a good policy
  • From its experiences of the environment
  • Without losing too much reward along the way
  • Exploration & Exploitation
    • Exploration finds more information about the environment
      • choosing to give up some reward that you know about in order to find more about the environment
    • Exploitation exploits known information to maxmise reward
    • It's usually important to explore as well as exploit

  • RL is we need to solve the prediction problem in order to solve the control problem which means we need to evaluate all of the policies to find the best one.