Appearance
Lecture1: Intro to RL
[2025-11-04](file:///workspace/d047dbd0-d7d7-404e-8dc2-62572c04acd5/wl-Rrmz1Mj5CWeLLEwT_3)

- Characteristics of RL
- There's no supervisor, but only a reward signal
- Feedback is delayed, not instantaneous
- Time really matters
- Agent's actions affect the subsequent data it receives


- What we need is to figure out the algorithm in the Agent's "mind"

- The valid definition of state would be to only look at the last observation
Environment State

Agent State

Information(Markov) State

- The Markov state contains enough information to characterize all future rewards

- The Markov state contains enough information to characterize all future rewards

Major Components of an RL Agent

Policy


Value function

- The behavior that maximizes the value is the one that correctly trades off the agent's risks so as to get the maximum amount of reward going into the future, and that just automatically emerges from the performance.

- The behavior that maximizes the value is the one that correctly trades off the agent's risks so as to get the maximum amount of reward going into the future, and that just automatically emerges from the performance.
Model






- RL is like trial-and-error learning
- The agent should discover a good policy
- From its experiences of the environment
- Without losing too much reward along the way
- Exploration & Exploitation
- Exploration finds more information about the environment
- choosing to give up some reward that you know about in order to find more about the environment
- Exploitation exploits known information to maxmise reward
- It's usually important to explore as well as exploit
- Exploration finds more information about the environment

- RL is we need to solve the prediction problem in order to solve the control problem which means we need to evaluate all of the policies to find the best one.