## Transcript

Mitchell: Welcome to this talk on scaling and optimizing the training of predictive models. My name is Nicholas Mitchell. I'm a machine learning engineer at Argo AI, working on the perception stack for autonomous vehicles. I'm also a researcher at the University of Augsburg in Germany. This presentation will cover some work that I completed earlier this year. There are a lot of moving parts, so we'll stay quite high level.

## Challenges in Modern ML

Let's begin by having a brief look at the challenges of modern machine learning projects, especially those in industry. It's common knowledge that data sets keep growing as we get better sensors and more sensors. This means compute and storage need to grow with any solution that we might come up with. The complexities around the models and the size of the complexity keeps growing, especially when we think that modern machine learning normally means deep learning. These are being improved year on year at drastic rates, so time to optimize these models, generally also increases as we have bigger models and more intricate models. There's a cost tradeoff to be had when we keep in mind that we're in industry. I think about the design decisions for scale, there's a need for efficiency and for flexibility. We need things that are going to work for quite a long time. We need this flexibility to allow us to be faster if we need to be faster, or to maybe save money if we need to be a bit more frugal with our training resources. There are often production constraints. Maybe we need a model for a specific task and we have a specific budget in terms of time or in terms of compute, or in terms of money, so we need to keep that in mind.

## Who Might Benefit?

Given these challenges, what do I talk about in this presentation and who might benefit? I assume you have a working model, so my model is training. I assume you have the problems, however, that you have large data sets that may continue growing and so you might want to think about how to distribute this training. What are some tools for that? You might want to also keep fine-tuning your model either on new data coming in, or you want to keep tweaking the model so you want to know how to tinker with hyperparameters of a model to squeeze the last drop of performance out of it. We also want to do it efficiently, so we need to think of a way that allows us to change our approach over time without reinventing the whole training pipeline.

## The Star of the Show

We're talking about machine learning. The star of the show here in this story is, of course, the predictive model itself. However, for the sake of this talk, let's keep it generic when we think about the model. Let's think about something that takes an input and returns an output, so really just any predictive model. A specific goal that you have when you're also approaching one of these projects might mean changing some parts of the pipeline, and ways to optimize that model. Again, the model itself is not the focus of this talk, and we'll maybe generalize to some common models later on.

## The Supporting Characters

Let's look at the supporting characters of the story. There are three of them. When we do modern machine learning the compute is a big aspect of it. This is the physical hardware, or rather, how to get access to it. The second part is then the orchestrator, so once we start thinking about a lot of data in a distributed setting, we need someone to hook everything together for us. That's the orchestrator. Then we have the brains. This is going to be the optimizer which is going to be approaching the question of how to deal with all of these hyperparameters. For each of these, here are some of the tools that I'm going to refer to, the ones that I've worked with. I'll explain compute and orchestrator at a high level. I assume a lot of software engineers are fairly familiar with these tools, or a variant of these tools. I'll spend a little bit more time explaining the brains, this AutoML, and how the optimization could work for a machine learning model, distributed setting.

## The Compute

The compute, this is perhaps the easiest decision. It might already be decided for you, but based on your company or based on your location, you might already be using AWS, Azure, or Google Cloud. One thing if you are at the very beginning, and you still have the chance to think about it, is the on-premise versus cloud discussion. If you are, for example, a small company, it might be worthwhile actually investing in a couple of good computers to have in your office, as opposed to using the cloud. There's a reference at the end that will maybe help you decide if you're still in the early stages. Beyond picking a provider, probably one of those in the top right, it's also important to then look at the type of instances that they offer for their cloud computing, which hardware is available. What you need for your specific model. If you need a lot of CPU power to quickly churn through some lightweight models, or whether you need big, heavy GPUs for training one big monolithic neural network, for example.

## The Orchestrators

The orchestrators, these two, Docker and Kubernetes, they together offer quite a lot. First looking at Docker, we can think of Docker as a way of managing our dependencies. Everything is configured in a file, and that can be checked into a Git repository. It can be shared so we can get some reproducibility and also easy collaboration, because we just send a Docker file to somebody and they can exactly repeat our experiment, given the source code. It also allows a fine-grained interface to the hardware. Through the Docker API, it's possible to also constrain, for example, the number of CPU cores or the number of GPUs that are visible to a task. It's quite key when you come to a distributed setting, and we might want to, for example, use several of the GPUs on one physical machine. We can make sure that each instance of a model that's training only sees what it needs.

Kubernetes, one which Docker really sits, it's a way to efficiently scale and manage different services. In our case, one service might be training our model one time. There's a way to bring those all together, and allow automatic scaling, so more or less compute can be configured automatically through Kubernetes. It offers some fault tolerance, so if something goes wrong, it can restart it. It also allows us some way to introduce networking and allow our jobs to communicate with each other, and maybe only certain jobs need to speak to other certain services in your training pipeline. Kubernetes can manage all of that for you.

## The Brains

The third character here, the brains. We can treat this as a black box. If we do so then it will be this sentence here. You would say, "I tried this configuration for my model and this was what the model did. That was the result. Please tell me, what should I do next?" The AutoML team in Freiburg, Hannover, the two universities in Germany, they have a range of optimization tools. The one I'm going to be talking about is based on these components, hyperband and Bayesian optimization. Hyperband is a way to efficiently try a lot of configurations. It's like doing a random parameter search, but very systematically and less randomly breaking it down. We'll go into that. The Bayesian optimization here, the definition straight from Wikipedia, so usually employed to optimize expensive to evaluate functions. What is more expensive than a modern ResNet-101 model? These come together, and the implementation is the BOHB - HyperBandster. This is essentially the combination of hyperband and Bayesian optimization. They're important concepts.

## Hyperband

Let's look at the first concept, which is hyperband, and how that works, how we might use it. Let's imagine we begin here with a graph. This is looking at our model when we're training. On the y-axis here, we have an error metric, so this can be the loss of the model, for example. We can see on this axis that lower is then better, of course. On the x-axis, we have time, and the time here is how long the model is training for. This can be the wall time for example of a model. That's generally equivalent to money because we're paying for our compute by time, usually.

If we were to train naively, what we would generally do is a parameter grid search, for example, or a random search. What this might end up with is, we start training a model. We see, over time, we have a loss curve, which comes down and we end up here, so not quite enough target area. We try a lot of these random search fashion, and we have some of this. Maybe one converges quickly and ends up doing terribly. Maybe some just are terrible in the beginning. Perhaps there's a lot, which just have this gray zone where models form bad convergence, or would need a lot of time, a lot more data. In the end, we found a fairly good solution here, not quite in our target area, but not bad. However, if we think about the total cost of doing this is equivalent to the length of all of these lines. Every line there is time spent training, time on compute. This is a lot of black ink here on the paper.

Let's start the graph again, and this time let's explain what hyperband does. Hyperband first looks at your time, and you would define a budget and how much time you're going to allow. The hyperband would then proceed to break down the time into three rounds. We're going to have round one, round two, and round three. This is all configurable. We would begin by training some models as was done before, maybe by random selection or parameter grid. We have some loss curves as before, terrible one. One that converged very quickly. One that was going quite well here. One quite slowly, and maybe we get, again, all of this noise here. What hyperband would then do is look at these and take the most promising only to continue into round two. There's a culling of the weaker models. We would say this one is good, this one is good. Maybe let's take the best three, the best lower proportion. Then, let's just take a couple of random ones here and here, selected from this mass just because they might turn out well.

Then with these five, we continue. This one doesn't do so well. This one continues. This one continues, maybe. This one goes down, comes the best. This one just pans out. We do the same again, here. We take maybe this one and this one, and then maybe randomly select one of these two. In the final step, we continue training down. We end up with here, one, which is in our target area. The target area is generic, but we've reached quite a good performance here. If we look back and think about the line lengths, if we were to add up the length of all these lines, it's actually way less than in the previous case. We've reached here a better solution in this fictive example, but we've spent a lot less compute. That's the main idea behind hyperband, so quicker convergence.

## Bayesian Optimization

Let's go back and just put that together with Bayesian optimization. This is the graph straight from the paper about this, BOHB. We can see here that from the orange lines, hyperband on its own has a very quick convergence of 20x speedup at the beginning compared to a random search. At the end, it doesn't perform so well, however. The red line on the other hand, the Bayesian optimization is very slow to converge at the beginning, but reaches a really nice performance towards the end after training for a long time. The green line here is simply the combination of those two, so we've got the BOHB, getting the quick convergence of hyperband, and then Bayesian optimization coming in strong at the end to squeeze out the last performance points. We have a well-trained model here, and we've given a lead to a random search, we've got a 55 times speedup. That's the optimizer. We'll see now how that fits into the broader scheme here.

## The Basic Loop

The basic training loop we see here for training a machine learning model, for example. In the middle, we start with the model. We've defined a model which needs an input and gives an output. There's a configuration, which normally says how the model should be run, should be executed. It can hold things like learning rates for neural networks, or it can hold a number of trees in random forests. The other thing that the model needs access to is, of course, the data. Those two flow into the model, the model trains. There is some performance metric, how well is the model doing, and this loop continues. The model is then maybe tweaked, or it gets feedback, and it continues training until we're happy. At the end, we have this gray line. At that point, that's where we would probably take our model and go into production, and deploy it and get some real world feedback. Then this basic loop would continue making any tweaks you see fit.

## The Better Loop

Bring in the tools that I spoke about earlier, we can augment this basic loop to the better loop. Here, the only additions are those towards the top. We have an optimizer in the picture now, and we can see, all of this is encompassed with Docker and Kubernetes. Then running, in this example, on AWS. In this case, the model, again, needs to receive a configuration and have access to the data. It's the optimizer that's controlling the configuration. The optimizer here, the BOHB sees a configuration defined, a configuration space. It will in the first round, select some configurations, a couple of those, and it wants the model to train on those configurations. Then the performance metrics will be fed back to the optimizer which can then say, "Based on these performance metrics I'm seeing, which configurations should I choose next to optimize? Which ones should I definitely avoid?" It's a smarter way to search the configuration space.

Taking now Docker, Kubernetes, and scalability of compute into account, what we notice is the optimizer says, "I want to try these n configurations, maybe 5, maybe 20. Then I wait for the results of all of those to come back." That's a very parallelizable problem, really. What we can do here very easily with Kubernetes, is just say, I'm going to have five models. This could be five instances on AWS, each of which gets a configuration. It gets the Docker image to run a model with that configuration, and five results are then sent out and come back to the optimizer. This can continue. Now we've just basically cut the time required by five. The optimizer can optimize more quickly because it's getting feedback more quickly. That's the general overview of how these things fit together. It's also a good benefit of this abstraction here. We have a Docker which can run a model. The model is something that is generic, so something that takes some configuration and returns some performance metric to be optimized. This model can really be anything. This setup here with the scalability, and this optimizer can basically optimize any model that can return an error metric. This doesn't have to be just a neural network. It doesn't even have to be just a machine learning problem. It could be anything. It could be the analytical solvers used for motion planning, for example, in robotics.

## An Example Run

One example with some real numbers just to get a feel for it. I've trained one large model, for example, with 80 GPUs, which was 10, 8 GPU instances. I said to the optimizer that it can have three full training cycles. For example, if the model needed 100 epochs to train 100 times through the data, I said to the optimizer, you're going to have 300 epochs budget and use them as you see fit. I had a 10 dimensional hyperparameter grid. That means 10 different parameters were being searched at the same time. This ran for a total of three days in this case with that compute. Another thing that you could think about here of reusing the schematic in the slide before, is if we were to just provide one configuration, so the optimizer really had nothing to do, we would still really be able to take advantage of the scalability. We could automatically scale up our training to just use 8 GPUs or 100 GPUs, depending on if we have priority to train our model quickly, or not. It's another flexibility that is offered to the machine learning engineer in this case.

## Open-source Software FTW

All of this was done with open-source software, which is amazing in my opinion. The main ones here are, I used Python for all of the scripting, and then Docker, Kubernetes, AutoML, all based on Ubuntu. I used PyCharm. All of that was free. I used TensorFlow here for some of the neural network components. Another tool to help multi-GPU training of models is Horovod. I recommend people look at that if they need multi-GPU training. For logging of all of these experiments, there's the ML-Flow library, which could then log the progress for example of each of the configurations that were being trained, so I didn't need to wait to the end to see results. I could see that in real-time while they were all training.

**See more presentations with transcripts**

## Community comments