Neural networks learning game environments
LEARNING JOURNAL

Learning PufferLib

Learning RL the fun way, training agents at 1M+ steps per second.

Python | PufferLib | C

WHAT IS PUFFERLIB

The Framework

PufferLib is a high-performance reinforcement learning library written in C. It lets you train AI agents to play games at insane speeds, 1M+ steps per second on a single GPU.

You define an environment: what the agent can see (observation space), what moves it can make (action space), and how it knows if it's doing well (rewards). Then you let it play millions of games until it figures out a strategy. Speed is what makes PufferLib special, faster training means more experimentation.

I'm new to RL so take what I say with a grain of salt, but Joseph's streams made it look easy. PufferLib and I are becoming best friends... having way too much fun spinning up environments and watching agents learn.

EXPERIMENT 01

WC3 Mazing Contest

If you're familiar with Warcraft 3 custom games, this is the Mazing Contest. You get a build phase where you place towers to create a maze, then a run phase with a "runner" you try to delay as long as possible.

It got good enough to beat me consistently but definitely not the better mazing contest humans.

WC3 Mazing Contest agent building maze

Best Rewards I Found

1. Average Path Length

Maximize the distance the runner has to travel

2. Maximizing Edges Touched

Hence the little numbers, best utilize each wall

3. Thunderclap Time Slowed

Small rewards for maximizing slow time

Tried many others (wall-touch penalties, efficiency metrics). The edges approach worked best.

UNDER THE HOOD

Action & Observation Spaces

Action Space: Discrete(200)

0-99 place walls, 100-199 place thunderclap towers. Grid encoding: position = y×10 + x

Observation: 108 floats

100-dim grid + 2 resources + 3 phase info + 3 goal info

Dual-Phase Design

BUILD phase → RUN phase (BFS pathfinding evaluates). Agent learns mazes that maximize traversal time.

EXPERIMENT 02

Rift: Roguelike Agent

Rift is a roguelike I built inspired by Diablo 3's rift system. You enter a rift, fight through scaling difficulty, then return to town to shop and equip gear before the next rift.

The agent needs to navigate, kill monsters, collect items, and survive. Pretty happy with battle performance but haven't fully verified shopping behavior yet.

Rift environment training
UNDER THE HOOD

Action & Observation Spaces

Context-Dependent Actions

Rift: 14 actions (8-dir movement + Blizzard + potions + interact). Town: 17 actions. Same IDs, different meanings per phase.

Observation: 223 floats

Player state (17) + 10×10 ego-centric grid (100) + town interface (101). Enemies encoded by threat level.

Progressive Difficulty

Per Rift: HP ×1.25, Damage ×1.18, Speed ×1.05, Attack CD ×0.95

KEY LEARNING

The Observation Problem

I tried two ways to represent what the agent sees. Only one worked:

Local Grid

Worked

10x10 ego-centric grid that moves with the player. Cells marked by entity type. Trained near instantly.

Grid observation

Entity List

Future

Store 10-20 closest entities with relative positions. Agent kept running into corners, missing wall info.

Entity list approach
TOWN MODE

Shop

Reroll costs gold. Items randomly generated in different slots.

Town mode shop screen

Equipment

Manage gear between rifts. Better equipment needed as difficulty scales.

Town mode equipment screen
EXPERIMENT 03

Tower Defense

Classic tower defense, enemies spawn and walk a path. Agent learns tower placement and type selection to maximize kills.

Action Space: Discrete(577)

Each action encodes tower type + grid position. Invalid placements get small penalty (-0.1), agent learns constraints naturally.

Action 0: No-op

1-192: Normal Tower

Range 2, fast firing

193-384: Splash Tower

Range 3, AOE damage

385-576: Sniper Tower

Range 4, 2x damage

UNDER THE HOOD

Observation: 7 Channels × 16×12 Grid

Ch 1-5: Tower, Enemy, Path, Gold, Valid Placements

Ch 6-7: Coverage Map + Enemy Density

Coverage = 1.0 - (dist/range). Density = Gaussian blur. These teach spatial awareness.

KEY LEARNINGS

Balancing & Rewards

Agent initially chose Normal Tower 100% of the time. If one option dominates, it exploits it. Required several rounds of tuning cost, range, and damage until all towers became viable.

Reward function: kills + coverage bonus on placement. Removing coverage bonus made results worse, agent needs guidance on positioning, not just "kill things."

# Coverage bonus on placement

path_cells_in_range × 0.1

Speed variation (0.8x-1.2x per enemy) prevents synchronization exploits.

Tower Defense training 1
Tower Defense training 2
WORK IN PROGRESS

On Hold

Poker Self-Play

Texas Hold'em with agents playing copies of themselves. Currently too aggressive, not getting punished because opponents are equally aggressive.

This one's been humbling. Learning a lot about self-play training traps.

Poker self-play

Hotel Simulator

Hotel management: check-ins, room assignments, resource allocation. Maximizing occupancy and guest satisfaction.

Environment built, training not started. Interesting multi-objective optimization problem.

Hotel Simulator
TOOLING

Parameter Tuning

wandb.ai is essential for tracking experiments, comparing runs, and running hyperparameter sweeps.

As a beginner I kept thinking I was optimized, then realized tuning parameters could get 3x to 10x more steps per second.

# All I changed...

[policy]

hidden_size = 32


[rnn]

input_size = 32

hidden_size = 32


minibatch_size = 65536

3-10x

Faster

66%

Overnight

30k→300k

Steps/sec

Sweeping is nuts, ran overnight and got 66% improvement. Thankfully puffer sweep helps a lot here.

Before vs After optimization

Before vs After

Overnight sweep results

Overnight sweep

Wish I had multi-gpu to run more than 1 train at a time, but PufferLib lets contributors use their TinyBoxes.

Explore PufferLib

Thanks to Joseph Suarez for building this. If you're interested in RL, start here.