I noticed that if you go from training to watch and then back, the training temporarily drop significantly in score.
trained and made a viz for the model and then made it displace text.
should probably do a proper write-up:https://x.com/i/status/2038367016969724259
avg500 -4.6 last 500 episodes
peak 3959.3 best window
roll/s 20.68 20-step avg
progress 4388 562749 episodes
But at around 4K avg score you should see it solve the env almost every time.
Just a demo :) optimized for speed over stability.
Reward structure: Step: -1 Dot: +100 Win: +1000 so ~4k is max theoretical score on 6x6.
Looks like this is for Linux and Windows, on NetBSD I get this issue :(
> WebGPU is not yet available in Release or late Beta builds.