Deadfish🐟 Studying Free

Lecture1: Intro to RL

[2025-11-04](file:///workspace/d047dbd0-d7d7-404e-8dc2-62572c04acd5/wl-Rrmz1Mj5CWeLLEwT_3)

Characteristics of RL
- There's no supervisor, but only a reward signal
- Feedback is delayed, not instantaneous
- Time really matters
- Agent's actions affect the subsequent data it receives

The valid definition of state would be to only look at the last observation
- Environment State
- Agent State
- Information(Markov) State
  - The Markov state contains enough information to characterize all future rewards

Major Components of an RL Agent
- Policy
- Value function
  - The behavior that maximizes the value is the one that correctly trades off the agent's risks so as to get the maximum amount of reward going into the future, and that just automatically emerges from the performance.
- Model

RL is like trial-and-error learning
The agent should discover a good policy
From its experiences of the environment
Without losing too much reward along the way
Exploration & Exploitation
- Exploration finds more information about the environment
  - choosing to give up some reward that you know about in order to find more about the environment
- Exploitation exploits known information to maxmise reward
- It's usually important to explore as well as exploit

RL is we need to solve the prediction problem in order to solve the control problem which means we need to evaluate all of the policies to find the best one.