Learning RL
How can I Learn RL?
“Besides your book, what would you suggest for someone interested in learning it? Where to start? What on focus first?”
It’s the same as anything technical IMO. There is stuff to be learnt, and you can do that by reading. Read all the books and papers you can. But the real learning experience is… experience. Do it for real. Do it in your company. Do it at work. Then and only then do you learn what you need to learn to do your job.
Yeah, that’s the point. Doing your job is nothing like anybody else’s job. I could tell you to do certain projects but it wouldn’t make sense for your unique situation. Your first challenge is finding a problem that is valuable and makes sense for RL. Then work on that. Start simple. Start with software, then ML, then RL. Work your way up.
What you need is a RL driven learning curriculum that delivers training that suits your unique needs.
You could try University of Alberta’s RL Specialization in Coursera, and definitely read the Sutton/Barto’s book. I highly recommend David Silver’s Lectures. There are other lectures from DeepMind as well. There’s Spinning Up from OpenAI. You could also read Thomas Simonini’s introductory RL blogs. There’s plethora of beginners' and intermediate stuff out there for RL.
Do I Need To Know How to Program?
“I think one obstacle for learning is the lack of plug and play libraries like for more classical ML. Even with openai.gym, some of the recent books have code that doesn’t compile, because the libraries are still being developed and change too often. It is certainly an obstacle for people who are less experienced in programming.”
Although I have seen people/companies try to do data science without software experience/capabilities, I would recommend that gaining software engineering experience is as important as ML/RL experience.
Software is the language of applications, so if you want to build useable ML/RL, you need software. Of course this doesn’t apply to everything. E.g. you can just about get by with hosted tools for an analytics project and larger companies can hire multiple people with different skils (this is the general solution, by the way). But sooner or later you’ll need to code.
I have sympathy for your despair, however. I think the main issue is the complexity. Any complex system has a million ways to fail and it sounds like you’ve found most of them.
I’m Finding it Hard
I think it is pretty complicated and maybe I am not smart enough. Is the opinion that it is complicated or an I alone?
Yes this is complicated. It all is, for everyone. The interesting thing about ML is that industry is so much closer to research than any other discipline. For example, imagine trying to teach research-level physics at school. It wouldn’t happen. It takes decades to filter down and filter out.
Yes, I think that’s the main problem. The sheer volume. All you can do is keep chipping away and over time you will suddenly discover that reading that equation/understanding that acronym/appreciating related topics becomes much easier. Think of learning like muscle memory. The more you do it the easier it becomes. The less stressful it becomes. Don’t worry if it doesn’t all sink in on the first pass. That’s an indication that it wasn’t written well enough or it went too deep too quick. It’s not your lack of ability.
If you do struggle with something, and it’s important that you understand it (e.g. your job) then look for another source of the same information. It’s likely that different explanations of the same thing will fill in the gaps in your mental model. In life, and in ML, you want to generalize, not overfit!
Force yourself to read it through anyway, and don’t worry if you don’t understand everything. Then read something else. Get a different perspective. Work on it.
RL Usage
Is RL Better than ML?
“Why is reinforcement learning better than other forms of machine learning?”
Better depends on your application. Testing, experimentation and evidence will prove whether it’s better. But in general, any application that involves multi-step decision making could be improved by RL. ML makes one-shot decisions which are unlikely to be optimal in the long run.
See page 5 in the book.
Examples of Using RL in Production Use Cases
“Do you know of anyone that are using RL in production?”
So, there’s a variety of things that I’ve heard. Some public, some not. Let me try and recall some:
- Covariant AI demoed a super cool RL-driven pick and place robot.
- I’ve spoken to engineers that have used RL to improve their recommendations.
- I’ve spoken to leaders that have deployed RL as part of a continuous-learning strategy for their ML models.
- I spoke to another leader that managed to reduce the size of the ML team running their core recommendations algorithm by using RL.
And there’s loads of use cases reported in the papers. But of course, whether you call that production or not depends on what they are doing. Many are pure research. But lots are for research on current production systems. For example, this one from the YouTube team
More on https://rl-book.com/applications/ and in the book.
Of course if you know anyone that wants to develop production RL algorithms, let me know.
Replacing Teams of Data Scientists with RL
“Can you talk more about the story where RL replaced a team of Data Scientists?”
This is not another clickbait “we all won’t have jobs next year”. No I can’t I’m afraid, it’s not public knowledge.
To summarise, consider:
a) a team of 10+ highly educated, very expensive smart people tweaking neural network architectures and running massive expensive experiments (for example). This is what large tech companies do to solve heavily used, data-intensive systems.
vs.
b) an RL algorithm, with a decent reward function, that trains itself over the long term, to solve the actual business metric that the business is keen on improving. RL can easily match and with effort surpass the performance of that team quite quickly.
To be clear, the engineering challenge doesn’t go away, it shifts. Now these people are curators. Guardians of the RL algorithm that is actually doing the number crunching. There’s still a lot of engineering work that goes into building a system like that, but it’s not pure data science any more.
I’m being intentionally vague and speculative here, but you can see it happening.
How is Deploying RL Different to Deploying ML?
“How are they different from “classical” ML and DL models? What are the typical tools for training and deploying?”
First, bare in mind that there isn’t much industrial experience of running RL in production, yet. It’s not like ML, where there’s now years worth or experience to leverage. But I can speculate.
One of the key issues with RL is state. By definition the MDP loop is constantly evolving. New observations, new models, new actions. In particular, if you’re running an algorithm which is actively learning (most, but not all implementations), which means that the underlying state of the model (the trained parameters) are changing ALL the time.
One of the definitions of “modern” software is immutability and software that is free of side effects. By definition, an actively learning RL algorithm is mutable and most definitely has side effects!
So over the next few years I predict that there is going to be industrial research (i.e. new frameworks/blog posts/presentations/etc.) into how to run mutable RL algorithms in a robust way. I imagine that under the hood there will be a strategy to do some kind of checkpointing to make it pseudo-immutable.
On the training side, there’s loads. I can’t keep up. I did a review a long time ago and I’ve been meaning to update it. Take your pick.
On the deployment side, less so. Many of the frameworks above have some kind of serving mode, but I get the impression that most people have to roll their own serving infrastructure and tooling.
Simplifying RL Problems
“You noted that many industrial applications could be solved with something as simple as tabular Q-learning. I was wondering if you could elaborate on that with some examples?”
If you’re talking about “many problems can be solved with simple algorithms”, then yes, there are many problems with low hanging fruit, that can be solved with simple algorithms. This comes down to a trade off between business value and technical difficulty. If it’s valuable, and easy, then that’s the problem you should solve first. If it’s less valuable, but still very easy, then that might still be prioritised higher because it’s easy to solve.
Relating “easy” to RL, what I mean is that the state and action space is simple and the reward obvious. Most of the time you can simplify the problem too. Bite off a smaller chunk of the problem. For example, imagine you were Amazon and you were trying to create RL algorithms to restock your warehouse. Yes, you could try and view the warehouse as the environment and the products as part of the state, but that would be a massively complex problem. Instead, why not say that a box is an environment, and the singular items in that box is the state. That’s much easier to solve.
See what I mean? We’d have to get into domains to be more specific than that.
RL for Auto-ML
“I am curious as to why RL techniques are not widely used as a means to improve on supervised learning problems”
Why? I guess it’s some complex combination of attention, media, ease of use, advice, reading, media and something with the word OpenAI or Google in the name.
I mean, it’s there, it’s possible. Maybe it’s just waiting for someone to wrap it or market it better than the last person? Hint hint, nudge nudge. If you have a spare 6 months on your hands.
To be fair there are things out there already. For example I’ve used Optuna for hyperparameter optimisation, which has an RL solution in there. But they’re not selling the fact that it’s using RL. They’re selling the fact that it automatically does hyperparameter tuning for you.
Same with Kubeflow’s Katib. That has an RL mode too.
That’s the thing about engineering in general. People don’t care how the sausage is made. It’s the product that counts. And it’s why UI engineers take all the glory!
Simulations of Business Use Cases
“What would be an example of environment with which one can experiment at home? I have neither robotic hand at home nor trading partners willing to make biding wars. The card/text/video games are covered in much detail in the books. It will be more interesting to play with something resembling a commercial use case.”
The world really is your oyster here. You can create your own in a domain that you want more experience in (that’s a great way to gain experience). Or you can search through the thousands of gyms other people have created.
For example:
RL In Industry
What use cases that are currently solved by ML, better solved by RL?
“What are the most common use cases in industry where problems are framed as supervised learning (or ranking) problems, but you would reframe them as RL problems?”
Really great question and one that deserves a much more comprehensive and evidence-based answer.
But, if I had to try and fit it in a chat window….
I’d summarise the dilemma by reminding you of the Markov Decision Process (MDP - page 35 of the book).
If you have an environment that has state that can be mutated, if it can be observed, if you can alter the state through your agent’s actions, and if you have a business problem that where it pays to move the environment into a certain state, then by definition you have an RL problem.
To the first part of your question, common use cases masquerading as supervised ML… Any recommendations task. I think that’s broad enough for you! I would suggest that the vast majority of cases where people use recommendations are optimising for the wrong thing. The goal is to help the user find things as easily as possible so that they value the functionality and keep coming back/buying unnecessary plastic stuff. A standard solution (I’m grossly simplifying here) would build a model, in a supervised manner, that maps user intent to products, quantified by click through rate or something.
That’s entirely the wrong metric. You could use RL and train over full customer lifecycles. You could train on raw profit. Or the amount of time individual users spend on the site. Or whatever is most applicable for your problem. So the action is the recommendation (lots of research available on this). The environment is user and possibly the business/products. The observation is the product catalogue, user demographics, past history, information, the weather, etc. The reward is customer lifetime value or whatever.
Look up any of the RL recommendations papers for an academic argument as to why RL is better suited.
Is RL Mandatory in certain fields?
“Is reinforcement learning considered a crucial approach in robotics (or do you have an opinion on its use for this)?”
Crucial. Hmmm. Depends on how you define the word. I wouldn’t say it’s CRUCIAL, in capital letters, no. You can create perfectly adequate solutions using simple stuff like PID controllers and inverse kinematics.
The threshold is complexity. Once you need to do something remotely complex, like more complex than just “move to coordinates x,y” or as soon as it involves a non-trivial number of interacting components, then yes, RL is probably necessary. But I think that’s missing the point slightly. The great thing about RL is the interface. The MDP. It’s a way of defining problems, not solutions. And it can be applied to any project, simple or complex. If the interface is the same then you can use the same processes, the same techniques to solve a wide variety of problems. It scales from simple to mind-bendingly complex, very few ML techniques and say the same.
For example, if you worked for a robotics company and you sold a bomb-disposal robot and a floor-cleaning robot, you’d have to develop completely different architectures, systems, code, solutions, etc. But if you’re using RL, it’s the same. Define the environment, define what you’re trying to do, try lots of actions and learn which ones maximise the reward.
Why Is Adoption Low
“What factors have prevented the wide adoption of Reinforcement Learning in the industry?”
Good question. Probably just a combination of time, media exposure, market size, processing power, low-hanging fruit.
But statistics, and therefore ML, has existed forever. Only recently (i.e. a decade, maybe) has ML “taken off”. So you could argue that it took 200 years for ML to be adopted.
RL originated around the 90’s, so wait until 2290, then ask your question again.
So the real answer is market size and perception. It goes like this:
IT -> Software -> ML -> DL -> RL.
Because they are applied by/for:
Everyone -> Companies -> One-shot decisions -> Complex decisions -> Strategic/long-term decisions.
Each time you’re reducing the market size. And when you do that you are reducing media exposure. So it might seem like ML has been adopted and RL hasn’t, but in fact the market is just smaller. When normalised the perceived adoption is the same.
With that said, I do think you’re right, it’s not been adopted yet. Mainly because there’s a lack of books like mine and well defined use cases. We’ll get there…
Who Should be Pushing RL Adoption?
“Given that this is such a technical domain, who should be pushing for RL adoption?”
Like most things in life, I suspect there’s no easy or right answer. I’m no expert in management, but I think POs or PMs should be steering product development, but decisions should be agreed/discussed as a team. Ideas, solutions, metrics, everything, have to be defined by “the team” because no one person can know everything and get everything right.
I have the same argument with people that have the word “architect” in their title.
RL in 3-5 Years
“How do you see the field of RL fold out in the next 3-5 years?”
The correct answer to this is probably more boring than you were hoping for.
It will expand, it will get used more. It will become easier to use and will become more obvious where to use it (because you can use off the shelf open-source solutions).
Then RLOps will become a thing.
Then people will perceive it as being “adopted”.
Then something else will take the limelight.
If I put my marketing hat on it would sound similar except with more hyperbole!
The Human-RL Algorithm Interface
“Do you foresee any interplay between reinforcement learning for biological/machine entities in the future?”
Great question. I think the answer depends on how deep you want to go.
At a superficial level, yes, definitely, health apps in particular. RL driven, personalised nudges to help you loose weight, get fit, learn a new subjects, etc. are an obvious use case.
At a slightly deeper level, the introduction of RL in core front-line healthcare, like personalised medicine, shows strong signs.
But at the full-on I’ve-had-too-many-beers-deep level, you could imagine RL providing “life” strategies. Like a personalised, optimal route to getting a job that you want. Or “automated relationships”.
Haha. I need that. Imagine not having to remember anniversaries, the perfect present automatically ordered.
And in psychological wellness. Yes, definitely. “Hi Dave, you look sad Dave.” - Red Dwarf
Does RL Need a Rebrand
“Do you think that the field needs a new name or branding?”
Not necessarily a new name, no. But I would like to see RL become more mainstream; part of the ML toolbox. I’d like to be at the point where people say (at the most general level) “we’re working on a data project and we might need to dip into our toolbox, rummage around, and we might need to pick RL for the job”.
And the term RL represents quite a small spectrum of techniques. You could use the words for all the sub-techniques too if you want to be more specific (e.g. imitation RL, inverse RL, curriculum RL, etc. etc.).
“Product claims” are powered by marketing/advertising. So that’s entirely driven by marketing departments. I’d suspect at some point someone in marketing will hear the term, go “oh that’s cool, is that like AI?” and then they’ll run with it. “The first app to use RL…”
Then there will be a domino effect, then users will get confused/annoyed, and then marketers will stop using it again and move on to the next thing.
This is why I tend to try to avoid predicting marketing hype cycles. They are so fickle. The core technologies and concepts are useful in certain applications and that is why it will stick around for a long time.
Industries Affected by RL
“Which industries do you see being most affected by advancements in reinforcement learning?”
For reference, see page 5-7 of the book.
“Industry” is a tricky word because it is broad and out-dated. It’s similar to asking what industry could make use of software. Of course, all of them could.
There are opportunities everywhere.
With that said, it’s a valid question. So far, robotics seems to be the number 1 use case. Simply because it’s hard to derive control programs for complex tasks. It’s easier to learn them.
Pricing/bidding/recommendations/advertising/etc. are largely similar tasks and have also had a lot of press.
The finance industry are going to be big users. I’ve spoken to people already that are using it.
Healthcare and specifically personalised medicine is a perfect match, although the regulatory requirements are likely to prevent this from taking off.
The Tech industry can leverage it to much greater extents for automation. E.g. ML, auto-ML, neural architecture search, etc. Lots of mundane automation like Alexa, email control, etc.
And lots more…
Industry Specific Adoption
“How do you see the adaption of RL by the industry?”
RL is nascent at this point. But it is moving. I don’t think it will be anywhere as big as the generic ML/analytics industry, which in turn isn’t as big as the software industry.
But as you probably know already, these are just tools in your tool belt. The trick is to pick the right tool for the job.
In terms of adoption, I think it’s being adopted already. It’s just a matter of size. I think it will cascade as more “normal” use cases come into popular industrial culture. And as frameworks/libraries start to offer easy to use and robust RL serving, natively.
In short, we’re fighting against low-hanging fruit here. Quite often something very simple is good enough and/or better than nothing. It takes quite a lot to jump up through the hoops of full ML to full RL.
This probably means that it’s going to be larger companies that adopt first. Smaller ones (at least in the non-tech industry) will probably have to wait.
Yeah, to be clear, I see RL taking a slice of the ML industry. So RL depends on the underlying size of the ML and software industries.
RL in Healthcare
“What you said about healthcare is interesting, why would regulatory requirements prevent reinforcement learning from improving things there?”
Healthcare == people’s lives. So there’s lots of rules and regulations to prevent accidents. This means there’s a very high barrier to entry. (I’m talking from a UK/EU perspective by the way - there may be fewer regulations in, say, the US for e.g.)
RL Tips and Tricks
Debugging
“What are your top tips for debugging RL algos?”
Check out chapter 11 for more detail on this.
Here’s some random thoughts off the top of my head:
- Visualise what is going on (like any data-related task)
- If you are given the environment, start with the simplest algorithm and work up (e.g. random/CEM).
- If you have control over the environment/simulation, make that as simple as possible and solve that first. Then make the environment/simulation more complex.
- Split the tech. If you’re working with deep models, attempt to decouple the training of the deep NN from the RL. Not always optimal, but makes development much easier. For example, use autoencoders, train the autoencoder first and verify it works. Then pass the much lower-dimensional state into the RL algo. It will train much faster (possibly less optimally) and it will be easier to figure out issues.
- Split the problem. Try and halve the problem. Halve it again. Solve each quarter independently.
- Consider hierarchical policies (similar to 5). If you can manually design the hierarchy, even better for understanding/explainability. But you can automate that process too.
- Good old debugging techniques. print’s are your friend.
- Assert expected array sizes
- Don’t overcomplicate the reward function.
And more and more…
RL Algorithms
What is Tabular Q-Learning
And if your question is actually “what is tabular Q-Learning”, then Q-learning is a simple RL algorithm and tabular means “use look-up tables to store the Q-values”.
Which Algorithms Are The Most Popular?
“Apart from multi-armed bandits, what are the other RL techniques that are getting wide adoption in the industry?”
Good question but it’s hard to obtain any real numbers on this. From my research/reading, most people tend to follow the media. If a particular algorithm gets media attention then it’s then quite popular in the frameworks which then leads to adpotion.
In general though, the tried and tested, simple models tend to remain the most popular. From basic Q-learning based algorithms, to simple policy gradient algorithms like SAC.
There’s no one-size fits all “best algo” though, like in ML, the “no free lunch” theorem. So you have to evaluate and experiment for your particular application.
What Does The “Best” Action Mean?
Once the agent is trained and the Q table is filled out, does the agent make a decision at each step by picking the highest Q value? The sentence that keeps appearing is that the agent makes its own policy. I have programmed my agent to always pick the greatest value in the Q table at each turn of a chess game. Is that the correct approach?
This is dependent on the algorithm and the output action.
For example, in a policy gradient algorithm, with a continuous action, most implementations suggest that the agent learns the optimal mean and std dev and then the “best” action is to sample from that normal distribution.
In Q-Learning, in discrete action problems, picking the action with the highest q value (the value of the state-action pair) is selecting the “best” action.
I keep writing best in quotes because of course you don’t really know that it is best, just that when the agent visited that state action pair in the past the outcome was on average good.
So to answer your question, yes, you are correct. Take the max. But know that this is dependent on the algorithm.
And you have programmed your agent to pick the best action, yes. You have told it what is the best action. It learns that for itself from experience.
I can’t think of an example where you would not want to pick the best action. A robot from r/shittyrobots
perhaps? :-D
Interesting thought experiment though. I have seen some algorithms picking the “opposite” action some proportion of the time, as a counter-factual test. But that’s more about the exploration problem, not what you’re asking about.
Application of RL
In Relation to MLOps
“I know you also have a lot of interest in MLOps. Is there any connection between it and RL in your work?”
Great question.
You’re right. I am very interested and we’ve gained a lot of experience delivering MLOps projects.
The connection to RL is the operational part. RLOps, if you will. Just like in ML, data scientists probably aren’t that interested in spending massive amounts of time messing about with infra/tooling. They’re job and responsibility is extracting value from data, not building supporting infra.
The same is true in RL too. The value is delivering the algorithm that optimises the business metric. The Ops part is irrelevant. The business doesn’t care how it happens, just that it does.
But the business certainly does care how long it takes and whether it is operationally viable. They’d be the first to complain if it breaks. So there is value in the supporting tech/infra, but it’s not directly tied to the business objective. The value is “making it easier for other people to do their job”.
Since RL is hard to do well, and very difficult to operationalise/productionise, RLOps certainly has a very important role to play.
Motion Capture and RL
“Could it be possible to improve performance of RL agent doing humanoid motions by virtual demonstrations of a person wearing a mocap suit?”
100% yes. This is a perfect example where Behavior cloning/Imitation RL (see chapter 8 in the book) will be useful. In fact, this reminds me of a paper that I read a while ago… Here: https://bair.berkeley.edu/blog/2020/04/03/laikago/
Gif for example. 1) Motion capture, 2) no IRL, 3) with IRL.

About the Book
As a Textbook
“Can this book be treated as a primary textbook of Reinforcement Learning or a reference book for studying Reinforcement Learning?”
Sutton and Barto’s seminal book on Reinforcement Learning is a fantastic academic resource. My book is focussed towards industry. I cover more modern algorithms and talk a LOT more about how to do RL in industry. Sutton/Barto’s book is more formal, has a lot more maths, talks less about industrial concerns. In short, Sutton/Barto’s is a textbook. Mine is an O’Reilly book.
Many people have ready Sutton and Barto’s book, but are looking for an alternative. I’d suggest that most engineers in industry are probably going struggle a bit with Sutton’s because it’s too academic. I’d definitely recommend reading mine first then graduating to Sutton/Bartos for all the gory mathematics.