(Part 1) DIY Machine Learning Plattform - Intro & Pipelines


In my day job I have the privilege of building and deploying machine learning (ML) solutions for a large company using state of the art cloud technologies. While extremely powerful and versatile, the effort to serve such ML-models in a scalable, realiable and repoducible way is quite high and incorporating new technologies and architectural solutions into existing software stacks requires significant time and effort.

To have more a lightweight technology stack for playing around with solutions to different problems in the ML-lifecycle I had the idea to build a ML plattform by simpler and more accessible means. Thus I would like to invite you along my journey in building a machine learning plattform based on open source software and running on a humble Raspberry Pi.

All posts in this series:

What are the challenges for productive machine learning usecases?

Before starting to build stuff lets look at why only using Anaconda and a Jupyter-notebook is not enough for machine learning in production (i.e. business processes and customers relying on your models output). In the introduction I already mentioned three major requirements when somebody else is making (or loosing) money with your model:

  • Reproducibility: When something goes wrong or we want to build up institutional memory you need to be able to answer questions like: Which code versionwas used to create this prediction? or What data do you used to train the model? and you need to fullfill requests like Please switch back to the model we used on DD.MM.YY or Show me the data model version x.y.z was trained on.
  • Reliability: If a business process depends on input from your model we need to make sure the model is available at all times. Even more: we need to make sure the model is performing its best at all times which means finding problems (for example in the source data) early and react to these problems.
  • Scalability: Increasing traffic to models should pose no problem in a production environment. But scalability is more: also increased number of models used by customers as well as an increasing Data Science-Team should be kept in mind when talking about scalability.

This list is by no means comprehensive but it should make clear that there is more to machine learning then finding a good performing algorithm. We need help in answering all the questions and solving all the problems outlined above. We need a special piece of software. Meet the machine learning plattform.

What is a machine learning plattform?

Machine learning plattforms help data engineers, data scientist and ml engineers and data engineers to build end-to-end machine learning solutions in such a way, that the challenges mentioned above do not turn into real problems in your production environment. Such a plattform will help in building an maintaining data pipelines and model pipelines which are easily maintained, monitored, explained and scaled.

A ML plattform does this by providing the data team with the features and components shown in the following image. Features and Components of a ML plattform Lets go through the components, starting with the core features of the plattform:

  • Data Ingestion: We need a way to ingest external data and store it so we can later use it for our model.
  • Data Transformation & Validation: Only storing data does not help of course. We need a way to transform raw data into features which can be consumed by our models. Additionally we need to be able to check our data for constraints, completeness and other problems.
  • Feature Store: One of the central architectural patterns which emerged in the last years is the feature store. The feature store allows us to store up-to-date features which can be directly consumed by our models. It stores not only the latest version of a feature value but it stores a complete history which means the feature store can be used to create training data (offline feature store). The online feature store is populated with the most up-to-date version of our features and is thus used when a model needs to create a prediction. The beauty of this architecture is the decoupling of the feature creation from the creation of the prediction. We can change the underlying features a model uses without any consumer noticing it. Also the transformation steps for creating the features for creating a prediciton and for training are the same which means less possibility of screw ups.
  • Model Training: We need (lots of) compute power for training our models and we need to be able to start multiple trainings in parallel. For training of computer vision models you would want to have access to GPUs as well.
  • Experiment Tracking: Every model training with different algorithms, different model architectures or different hyperparameters can be seen as individuel experiments. We need to keep track of all these experiments and their outcome.
  • Model Registry: After finding a suitable model (which is the combination of code and the artifact resulting from the training run) we need to store the model to a central repository. In this way everbody has an overview of available models. We also need to make sure that model versions should be persistent so we can fullfill reproducability requirements.
  • Model Serving: When we are happy with our model performance and we want to make predictions available to customers we need to serve the model. This can be a batch serving service or an endpoint which allows for realtime requests to the model. Scalability as well as realiability are of importance for this feature.
  • Model Monitoring: One of the most important but in my experience often overlooked parts of an ML plattform. We need to observe the behavior of the model and monitor its output. This allows us to compare the predictions against the ground truth (what really happened) and thereby access the model performance in the real world. It is also convenient to monitor the input into the model so we can quickly distingush between data drift and concept drift.

Quite a lot…but that is not all. To enable a lot of the features we need some very fundamental building blocks shown in the image above below the ml plattform (since…you know…the plattform rests on those pillars).

  • Code Version Control Hopefully self explanatory. We need a way to version our code and store it these versions.
  • Data Version Control: Maybe not so self explanatory. A ML model is a combination of code and a training artifact and the training artifact can only be created from data we need to version datasets and store those version so we can specific version of the data.
  • Pipelines: Maybe THE fundamental building block of a ML plattform. We need to execute logic in specific order. This code can be code for transforming data, creating features and storing them in the feature store or executing a training and deploying a model.

What will we actually be able to implement on our DIY plattform?

So…all of the functionalities described need to be available on a modern ml plattform and for all of these components one would need to ensure reproducability, scalability and reliability including of course following appropriate DevOps principles like CI/CD, dev and prod environments, “infrastructure as code” and so on.

To make the challenge more clear lets just focus at technical scalability of served models. Already requiring redundancy and scalability for just one model endpoint would require us to serve the model by more than one server. This makes clear that a single Raspberry Pi will not be able to fullfill all what we layed out abive.

What to do? Since I am a data scientist I want to focus on the components of our plattform which would impact the daily live of a data scientist building, deploying and monitoring models. Thus we will focus here on the core components and their interfaces to each other while we will not be able to meet all the production requirements discussed.

Nevertheless we will be able to…

  • …perform simple data ingestion based on schedueled pipelines.
  • …perform data validation as well as feature engineering which we store in a feature store.
  • …perform a model training based on training data created from our feature store.
  • …keep track of training results, register models in a central repository and deploy a realtime endpoint.
  • …monitor the predictions of our models.

All of these functionalities will be available in three different environments (dev, staging, prod) which allows us to develop and test stuff while still being able to serve other versions of the same software.

As I said, we will not bother with setting up a cluster, we will not implement dev and prod environment for our foundational tools (i.e. our pipeline scheduelr), we will only have limited CI/CD functionality and so on. Still at the end we will have somthing with which we can get a feel for the inner workings of a ml plattform and how all these components interact with eachother.

Which technologies will we use?

At the time of writing the plan ist to use the following tools and libraries:

  • Source Version Control: git and gitlab
  • Data Version Control: dvc
  • Pipelining: dagster
  • Feature Store: feast
  • Model Registry, Experiment Tracking, Model Serving: mlflow
  • Model Monitoring: whylogs
  • Data Validation: deepchecks
  • Data Ingestion & Transformation: Standard libraries like pandas
  • Model Training: Standard libraries like scikit-learn

Whats next?

Lets start with a short disclaimer: By no means is anything written here complete and comprehensive. Almost any of the topics discussed today is significantly more complex, nuanced and needs other or additional tools when we talk about a production ready machine learning plattform. For example we completely left out the solution for storing your data or tools to efficiently perform your data transfomrations (looking at you spark…).

This post got longer than anticipated so we will not start building something related to our plattform today. Still you can use the time until the next post to setup your own Raspberry Pi so it allows ssh-access and has a working python installtion.