I'm most interested in how the equation can be implemented step by step in an ML library - worked examples would be very helpful.
Thank you!
Read here: http://incompleteideas.net/book/the-book-2nd.html
It includes both mathematical formulas and PyTorch code.
I found it a bit more practical than the Sutton & Barto book, which is a classic but doesn't cover some of the more modern methods used in deep reinforcement learning.
https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutto...
This will have a gentler learning curve. After this you can move on to more advanced material.
The other resource I will recommend is everything by Bertsekas. In this context, his books on dynamic programming and neurodyanamic programming.
Happy reading.
This is because they work assuming you know a model of the data. Most real world RL is model-free RL. Or, like in LLMs, "model is known but too big to practically use" RL.
Apart from the resources you use (good ones in other comments already), try to get the initial mental model of the whole field right, that is important since everything you read can then fit in the right place of that mental model. I will try to give one below.
- the absolute core raison d'etre of RL as a separate field: the quality of data you train on only improves as your algorithm improves. As opposed to other ML where you have all your data beforehand.
- first basic bellman equation solving (this is code wise just solving a system of linear equations)
- an algo you will come across called policy iteration (code wise, a bunch of for loops..)
- here you will be able to see how different parts of the algo become impossible in different setups, and what approximations can be done for each of them (and this is where the first neural network - called "function approximator" in RL literature - comes into play). Here you can recognise approximate versions of the bellman equation.
- here you learn DDPG, SAC algos. Crucial. Called "actor critic" in parlance.
- you also notice problems of this approach that arise because a) you don't have much high quality data and b) learning recursivelt with neural networks is very unstable, this motivates stuff like PPO.
- then you can take a step back, look at deep RL, and re-cast everything in normal ML terms. For example, techniques like TD learning (the term you would have used so far) can be re-cast as simply "data augmentation", which you do in ML all the time.
- at this point you should get in the weeds of actually engineering at scale real RL algos. Stuff like atari benchmarks. You will find that in reality, the algos as learnt are more or less a template and you need lots of problems specific detailing to actually make it work. And you will also learn engineering tricks that are crucial. This is mostly computer science stuff (increasing throughout on gpu etc - but correctly! without changing the model assumptions)
- learn goal conditioned RL, imitation learning, some model based RL like alphazero/dreamer after all of the above. You will be able to easily understand it in the overall context at this point. First two are used in robotics quite a bit. You can run a few small robotics benchmarks at this point.
- learn stuff like HRL, offline RL as extras since they are not that practically relevant yet.
I am unsure of the next course of action or if software will survive another 5 years and how my career will look like in the future. Seems like I am engaged in the ice trade and they are about to invent the refrigerator.
The way I like to look at it is that I'm engaged in the ice trade and they are about to invent everything else that will end mine and every other current trade. Which leaves me with two practical options: a) deep despair. b) to become a Jacks of all trades, master of none, but oftentimes better than a master of one. The Jacks can, for now, capitalize in the thing that the Machines currently lack, which is agency.
Instead of people "hacking" university education to make them purely fotm job training centers. The real hack would be something that really drills down at the fundamentals. CS, Math, Physics, and Philosophy to get an all around education in approaching problems from fundamentals I think would be the optimal school experience.
Once the fundamental concepts are understood, what problem is being solved and where the key difficulties are, only then the equations will start to make sense. If you start out with the math, you're making your life unnecessarily hard.
Also, not universally true but directionally true as a rule of thumb, the more equations a text contains the less likely it is that the author itself has truly grasped the subject. People who really grasp a subject can usually explain it well in plain language.
That's very much a matter of style. An equation is often the plainest way of expressing something
You give a physicist an equation of a completely unrelated field in mathematics and it will make zero sense to them because they lack the context. And vice versa. The only people who can readily read and understand your equations are those that already understand the subject and have learned all the context around the math.
Therefore it's pointless to try to start with the math when you're foreign to a field. It simply won't make any sense without the context.
(One of) The value(s) that a math grad brings is debugging and fixing these ML models when training fails. Many would not have an idea about how to even begin debugging why the trained model is not working so well, let alone how to explore fixes.
Why didn't the training converge
Validation/test errors are great but why is performance in the wild so poor
Why is the model converging so soon
Why is this all zero
Why is this NaN
Model performance is not great, do I need to move to something more complicated or am I doing something wrong
Did the nature of the upstream data change ?
Sometimes this feature is missing, how should I deal with this
The training set and the data on which the model will be deployed are different. How to address this problem
The labelers labelled only the instances that are easy to label, not chosen uniformly from the data. How to train with such skewed label selection
I need to update model but with a few thousand data points but not train from scratch. How do I do it
Model too large which doubles can I replace with float32
So on and so forth. Many times models are given up on prematurely because the expertise to investigate lackluster performance does not exist in the team.
But the examples you quoted were not my examples, at least not their primary movers (the NaNs could be caused by overflow but that overflow can have a deeper cause). The examples I gave have/had very different root causes at play and the fixes required some facility with maths, not to the extent that you have to be capable of discovering new math, or something so complicated as the geometry and topology of strings, but nonetheless math that requires grad school or advanced and gifted undergrad level math.
Coming back to numeric overflow that you mention. I can imagine a software engineer eventually figuring out that overflow was a root cause (sometimes they will not). However there's quite a gap between overflow recognition and say knowledge of numerical analysis that will help guide a fix.
You say > "literally every single example"... can be dealt without much math. I would be very keen to learn from you about how to deal with this one, say. Without much math.
The labelers labelled only
the instances that are
easy to label, not chosen
uniformly from the data.
How to train with such
skewed label selection
(without relabeling properly)
This is not a gotcha, a genuine curiosity here because it is always useful to understand a solution different from your own(mine).More often than not this is duplicated work (mathematically speaking) and there is a lot to be gained by sharing advances in either field by running it through a "translation". This has happened many times historically - a lot of the "we met at a cafe and worked it out on a napkin" inventions are exactly that.
Math proficiency helps a lot at that. The level of abstraction you deal with is naturally high.
Recently, the problem of actually knowing every field enough, just cursorily, to make connections is easier with AI. Modern LLMs do approximate retrieval and still need a planner + verifier, the mathematician can be that.
This is somewhat adjacent to what terry tao spoke about, and the setup is sort of what alpha evolve does.
You get that impression because such advances are high impact and rare (because they are difficult). Most advances come as a sequence of field-specific assumption, field-specific empirical observation, field-specific theorem, and so on. We only see the advances that are actually made, leading to an observation bias.
So in my specific case I stopped thinking about mathematics as: how to interpret a sequence of symbols
But instead I decided to start thinking about it as “the symbols tell me about the multidimensional topological coordinate space that I need to inhabit
So now when I look at a equation (or whatever) my first step is “OK how do I turn this into a topology so that I can explore the toplogical space the way that a number would”
Kind of like if you were to extend Nagle’s “what it’s like to be a bat” but instead of being a bat you’re a number
There are no dedekind cuts or cauchy sequences on digital computers so the fact that the analytical equations map to algorithms at all is very non-obvious.
For instance we know that algorithms like the leapfrog integrator not only approximate a physical system quite well but even conserve the energy, or rather a quantity that approximates the true energy.
There are plenty of theorems about the accuracy and other properties of numerical algorithms.
Discretizing e.g. time or space is perhaps a bigger issue, but the issues are usually well understood and mitigated by e.g. advanced numerical integration schemes, discrete-continuous formulations or just cranking up the discretization resolution.
Analytical tools for discrete formulations are usually a lot less developed and don't as easily admit closed-form solutions.
So I guess one might want to do a similar exercise to deriving numerical dispersion for example in order to see just how discretizing the diffusion process affects it and the relation to optimal control theory.
[1]: https://en.wikipedia.org/wiki/Numerical_dispersion
[2]: https://en.wikipedia.org/wiki/Courant%E2%80%93Friedrichs%E2%...
I'm not saying that this is the case here, but there generally needs to be some justification to say that a certain result that is proven for a continuous function also holds for some discrete version of it.
For a somewhat famous real-world example, it's not currently known how to produce a version of QM/QFT that works with discrete spacetime coordinates, the attempted discretizations fail to maintain the properties of the continuous equations.
https://en.wikipedia.org/wiki/Finite_difference
I'm not sure about applications of real numbers outside of calculus, and how to replace them there.
If your definition of "algorithm" is "list of instructions", then there is nothing surprising. It's very obvious. The "algorithm" isn't perfect, but a mapping with an error exists.
If your definition of "algorithm" is "error free equivalent of the equations", then the analytical equations do not map to "algorithms". "Algorithms" do not exist.
I mean, your objection is kind of like questioning how a construction material could hold up a building when it is inevitably bound to decay and therefore result in structural collapse. Is it actually holding the entire time or is it slowly collapsing the entire time?