Interactive tabular RL in four rooms

Fill in the update rules for model-free Q-learning, Dyna-Q, and hindsight experience replay. Then watch how the agent's values, policy, and learning curve change inside a classic four-rooms gridworld.

The world is an 11 by 11 four-rooms task: two long interior walls split the map into four regions, and each wall segment has a single doorway. The agent starts in the upper-left room and tries to reach a goal in the lower-right room. The fastest policies have to discover the doorways and then propagate value backward through them.

Why this example?

Four rooms is just large enough to make exploration and representation visible. A good value function has structure: states near useful doorways become valuable before the whole map looks solved.

What to watch

Use the value heatmap for learned expectations, the policy arrows for the greedy route, visits for exploration coverage, and the return chart for learning speed. A working update should make all four tell the same story.

The comparison

Q-learning learns only from real steps, Dyna-Q also learns from imagined model rollouts, and HER turns failed trajectories into useful goal-conditioned practice.

Part 1

Model-free Q-learning

Start with the one-step temporal-difference update. At first, the value heatmap is mostly flat. As successful episodes occur, positive value should spread backward from the goal through the doorway sequence. The greedy-policy view should eventually point toward a route through the four rooms.

Big idea: model-free learning is simple and honest. It only changes a value after the agent actually experiences a transition, so credit assignment moves one step at a time.

Part 2

Dyna-Q planning

Dyna-Q stores a tiny tabular model of experienced transitions, then replays imagined transitions after each real step. With planning turned up, learning should become noticeably steeper because the agent can update old states without physically revisiting them.

Big idea: if the world dynamics are reusable, memory can become practice. After the agent has seen a doorway once, planning can keep propagating that information through previously visited states.

Part 3

Hindsight experience replay

HER treats the state actually reached at the end of an episode as if it had been the intended goal. In the visit heatmap, failed wandering can still become useful training data; in the value view, the agent learns a goal-conditioned table for multiple possible outcomes.

Big idea: failure can still be data. HER asks, "What goal did this trajectory accidentally solve?" and uses that relabeled goal to train a more flexible representation.

Comparison lab

Compare and customize

Switch between all three algorithms, tune exploration and planning, move the start or goal, and edit the four-rooms walls. The chart keeps separate curves for each method so you can compare how quickly each one adapts under the same arrangement.

Try training each method for the same number of episodes, then shift the goal. Q-learning should need fresh real experience, Dyna-Q can reuse its transition model, and HER should expose how goal-conditioned values differ from a single-goal value table.