Production ML: Model Deployment and Serving Introduction

This article is part of a series describing Production ML – Production ML Series of Articles.

Finally, we are at the last stage of our Production Machine Learning. Well, almost at the latest. We will also cover model monitoring and evolution automation in a couple of further articles.

First of all, I want to thank you for following along all the way here if you were reading all the previous articles. And if not, it will make sense for you to look through the earlier articles in this series to have a better context.

Today we will talk about ways to make your model produce what it was created for – results. You will need a trained model and a good understanding of functional and non-functional requirements for your model performance.

Requirements

If you are (or were) a Business Analyst or a Project Manager, I think you know what those are. But if not, no worries, it is pretty simple.

Functional requirements describe what your application will do, like the features and functionality.

This is what most of us mean when saying requirements. One important thing to remember when discussing ML application requirements is that it is an Application, which can be a lot more than just a model. At a bare minimum, your application will have to be able to:

receive data real-time or/and in bulk,
pre-process and prepare the data,
feed data to the model and receive results,
interpret results and send them back to the requestor.

The four points above describe a very basic REST API application. Usually, it is much more. For example, in high load systems, your application will also have to manage queues of requests, priorities, complex requests, error handling, and much more. Ideally, you would have thought about functional requirements before you started the journey of building the model, but now is an excellent time to check if nothing has changed. In my experience, it is very common that new functionality is added or existing has been changed by the time we get to the deployment phase.

Non-functional requirements usually describe how your application should perform its duties.

Functional requirements usually define data and model architecture, whereas non-functional requirements, most of the time, define deployment and infrastructure setup. They determine how fast an application should work, how many requests per second it should be able to serve, and other requirements in our app’s security, reliability, availability, and scalability.

Your non-functional requirement will help us define the best way of deploying our model.

Deployment options

Depending on the requirements (primarily non-functional), we can deploy our model to:

Virtual Machine or VM scale-set;
Containerized environments like ECS, EKS, etc.;
Serverless environments provided by cloud (SageMaker, Vertex, Azure ML, etc.);
Edge devices (mobile phones, tablets, drones, specialized equipment, etc.).

Virtual Machine is a viable and straightforward solution that will allow you to start fast. Later you can create a Machine Image from it and deploy the VM scale set behind Load Balancer. Many of the production systems are working this way today. The disadvantages of such deployment architecture are fairly complex CI/CD and update processes. When you change your model or application sources, you must create a new Image and deploy it gradually to the Scale Set. Another complexity factor is that you will perform system updates and patch management. Modern cloud environments simplified this to the extent that you will only manage update domains in Availability Sets and schedule patches and updates accordingly. You will also have to make sure your application is compatible with new versions of libraries and tools, which can be pretty cumbersome. You can also simplify CI/CD with booting scripts and magic, but these are all topics for other articles.

Containerization will help you avoid patching and software management overhead as it encapsulates all necessary libraries and packages into the container with your application. It is also a great help when your application is decoupled. Orchestrators like Kubernetes make deployment and management of decoupled applications much easier. This will allow you to focus on what is important and leave the rest to the orchestrator and configs, making deployment and managing a highly complex application much easier. It also makes application performance, scalability, and reliability much more efficient. I deliberately did not add security to the list because, in the Kubernetes case, you will have to pay specific attention to the security risks. Again, more on this in the following articles. As a rule of thumb: I would advise you to go for Containers whenever possible, and you have the required knowledge (or a specialist who can do it for you).

Serverless is the next level of abstraction provided mainly through cloud providers. It manages all infrastructure so you can focus on models and applications. It is the easiest and (in most cases) the safest way to deploy your application. The highest reliability and scalability are guaranteed by the cloud provider and are managed through a couple of simple checkboxes and dropdowns. You can bring your code, container, or even build on top of existing solutions presented on the cloud marketplace. Serverless ML app deployment options are often done via AWS SageMaker, Google Vertex, and Azure ML Studio. These environments are specifically tailored for development, testing, and deploying ML applications to the cloud. This option is usually cheaper and more efficient time-wise compared to spinning your own VM or Containers. The platforms mentioned above provide tons of useful tools for Data Science and Machine Learning. They can be used by people with different levels of knowledge, from very basic to advanced.

As devices are getting more advanced, it is getting more common to deploy models to the Edge devices such as mobile phones, drones, and other smart devices. This enables us to make fast and reliable inferences tailored to a specific customer. We can also gather data for model training and fine-tuning from a wast net of devices. Libraries and frameworks like TFX and TFLite provide excellent distributed learning capabilities, deployment, serving, and updating on the edge devices and fleets of edge devices. Apart from using some extra libraries, you will also have to account for a particular device’s specific limitations and capabilities. Like, for example the Google ML kit on Android devices. Also, most of the devices have very limited resources, which will impact all phases of the pipeline, from data preparation to the deployment and monitoring of the model/application.

Deployment process

As mentioned above, the deployment process and setup vary dramatically based on your chosen deployment option. Or rather your requirements define.

The good news is that in any deployment model, you should have certain structural elements in place:

Model in the Model registry;
API endpoint;
App hosting;
Authorization and identification;
CI/CD;
Modell/app monitoring.

When training is done, you save your model artifacts, which usually include weights, architecture, and config files. It is usually a good idea to have a repository for models to enable and manage to version, which often is called a Model Registry.

Your application will usually receive requests from a specific REST API endpoint. The endpoint can also optionally have a Load Balancer behind it. This logic prevents the need to communicate with each application instance separately.

You will also implement the process during which your model from the model registry will be deployed to the infrastructure so the requests sent to the API endpoint will reach it. This process we will call model Continuous Integration / Continuous Deployment. Ideally, it is continuous. Sometimes it is not, though :). When the model is deployed, you will monitor its work. This is often overlooked but an essential part of the MLOps pipeline. It will ensure you cover data and concept drifts. I guess I will explain data and concept drift terms just in case they are new and confusing for you.

Let’s imagine a Machine Learning task to be formulated in a very simplified way:

We have X, which has n data points, each being a vector like this: (Xn1, Xn2, …., Xnm);
We have Y, a set of n labels for each from n data points (Y1…Yn).

Data drift happens when the distribution of X has changed over time.

For example, we had three segments of customers: young, middle-aged, and senior. Youngsters were 20%, middle-aged represented 50%, and the remaining 30% were senior people. As time passes, we can see how the age split is changing; now, it is 10% young, 20% middle-aged, and 70% senior people. Such a change in data will impact the model’s performance, so it has to be retrained on the new data.

Concept drift is when the relationship between X (data points) and Y (labels) has changed over time.

A simple example is when initially, most of the senior people from Dusseldorf with above average income (our X) preferred Mercedes (label Y). But as time passes, their preferences shift toward Tesla, for example. Our model prediction accuracy will drop because it no longer captures the so-called prior probability.

Last but not least, you will think about your deployment’s identity and access management component. We will not describe it in this series, but there is a separate section of the blog where we talk a lot about security.

Next steps

In the following articles, we will cover all deployment options and components described above. It will be a long series of articles, and I hope you will enjoy reading it. I certainly will have fun writing it.

Due to the fact the data transformation, training, and tuning steps would look differently for different deployment options, we will be using different datasets. Simply because our goal will be to come up with the best investment (the time we spend) / outcome (knowledge we get) ratio for you.

We will start from simple deployment to VM, which then can be scaled to VM scale-set. Then we will cover AWS SageMaker, Google Vertex, and Azure ML Studio. And finally, we will talk about containerized deployments. I left containers to the last because it will be pretty straightforward, and I am also planning on adding some articles about containers which I will reference for your convenience.

I hope everything makes sense to you now, and let’s carry on to the following article:

Requirements

Deployment options

Deployment process

Next steps

Semantic Similarity in Natural language Processing. Part 2.

Production ML: Deploying model to Virtual Machine. Part 1.