Two architecture for Machine Learning model training used in enterprise programs
Machine learning for seasoned software professionals – Part 2
Note: I am writing the part 2 of the Machine learning series for seasoned software professionals. I personally feel very strong that if you are fresh in the AI domain and if you start directly with Gen AI it will be a huge problem as you can only develop superficial know how of this huge and ever evolving area.
The alternative idea is to start from machine learning as it has a common ground for all of us to skill up and that common ground is data which is so inclusive to machine learning.
I know it sounds a lot but the current time and job market warrants that. Today machine learning has more projects than AI, so it may be a wise step from application angle as well.
This is a long newsletter, so consider saving it or marking a star in your email
Design patterns are proven solutions to recurring problems. When deploying large-scale machine learning architectures in production, several design patterns can be applied to ensure scalability, reliability, and maintainability.
Some common machine learning pattens involves data pipelines, model training patterns, model deployment patterns, data storage patterns, machine learning resiliency patterns, scalability patterns, logging and monitoring patterns and security patterns.
Coming up in this article, I will talk about “model training approaches” in enterprise projects especially –
1. Parameter server pattern
2. Checkpointing pattern
What the hell is a ‘model’ in machine learning?
In machine learning, a "model" refers to a system or algorithm that has been trained on data to make predictions or decisions without being explicitly programmed for that specific task.
A model learns patterns from historical data and then applies these patterns to new, unseen data.
Example from a Retail Use Case: Product Recommendation System
1. Problem Definition: A retail company wants to suggest products to its online users based on their past purchase history and browsing behaviors to increase sales and improve user experience.
2. Data Collection: The company collects data on:
Products each user has bought in the past.
Products each user has browsed or added to their cart.
Products often bought together.
User demographics and preferences, if available.
3. Model Training: Using this data, a machine learning model (like collaborative filtering or matrix factorization) is trained. For simplicity, let's consider collaborative filtering:
The model learns patterns like, "Users who bought product A also tend to buy product B."
If a new user buys product A, the system can recommend product B based on this learned pattern.
4. Using the Model: Once trained, the recommendation model is integrated into the online shopping platform. When a user browses products or checks their cart, the system uses the model to generate product recommendations, displaying items that the user is likely to be interested in purchasing based on their behavior and the behaviors of similar users.
5. Outcome: The outcome of using this machine learning model is often multi-fold:
Increased sales: Users often find the recommendations relevant and end up buying suggested products.
Enhanced user experience: Users find it easier to discover products they might like, leading to higher satisfaction.
More effective marketing: Personalized recommendations can be more effective than generic ads or promotions.
In this example, the "model" is the recommendation system trained using collaborative filtering. It's important to note that models can be continually updated and refined as more data becomes available or as user behaviors change.
ok now let us see what is model training?
What is model training in machine learning and why it matters?
Model training is one of the core processes in machine learning. It involves feeding a dataset into a machine learning algorithm to learn patterns, relationships, or structures from it. The primary objective is to build a model that can make accurate predictions or decisions based on new, unseen data.
How Model Training Works:
Selection of Algorithm: The process starts by selecting a suitable machine learning algorithm, such as linear regression, decision trees, neural networks, etc., depending on the problem type (classification, regression, clustering, etc.).
Feeding Data: The chosen algorithm is then fed with a dataset, which is typically split into two subsets:
Training Set: Used directly for learning. The algorithm tries to find patterns in this data.
Validation Set: Used to fine-tune model parameters and prevent overfitting.
Learning Process: During the training process, the model makes predictions on the training data and then adjusts itself based on the error of its predictions. This adjustment continues iteratively until the model's predictions are satisfactory or further training no longer significantly reduces error.
Evaluation: After training, the model's performance is evaluated on a separate set of data called the test set to gauge its accuracy, precision, recall, etc., on unseen data.
Model training architecture 1- Parameter server
The Parameter Server Pattern is a design pattern used for distributed machine learning, particularly for training large-scale models. This pattern can be especially useful for models that are too large to fit in the memory of a single machine or when training data is massive.
Fig: Architecture of a parameter server communicating with several groups of workers. Source: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf | Paper name: “Scaling Distributed Machine Learning with the Parameter Server”
Parameter Server Pattern
Basic Concept: In the context of the Parameter Server design pattern, "parameters" refer to the set of variables or weights that define and are optimized during the training of a machine learning model. These parameters are adjusted iteratively based on the training data to minimize the model's prediction error.
In distributed machine learning, the Parameter Server centralizes these parameters, allowing multiple worker nodes to access and update them during parallel training sessions.
In the Parameter Server Pattern, a distributed machine learning system is partitioned into:
Parameter Servers: These are nodes that maintain a global version of the model parameters.
Worker Nodes: These nodes are responsible for computing gradients on their local data.
Training is iterative. In each iteration:
Worker nodes pull the latest model parameters from the parameter servers.
Each worker computes gradients based on its local data and the pulled parameters.
Workers then push these gradients back to the parameter servers.
The parameter servers update the global model parameters based on the received gradients.
This pattern allows for the asynchronous updating of parameters, which can lead to faster convergence in some scenarios compared to synchronous updates.
Example: Distributed Training of a Deep Learning Model
Fig: shows that for 100 workers, each worker only needs 7.8% of the total parameters. With 10,000 workers this reduces to 0.15%
Source: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf | Paper name: “Scaling Distributed Machine Learning with the Parameter Server”
Initialization:
A deep learning model with millions of parameters is initialized.
The parameters are distributed across multiple parameter servers.
Data Partition:
The training dataset is divided into batches and distributed among multiple worker nodes.
Training Cycle:
Each worker node pulls the current model parameters from the parameter servers.
The worker processes its batch of data, computes the gradients for the parameters based on its local data.
The worker then pushes these gradients back to the parameter servers.
The parameter servers aggregate gradients from all workers and update the global model parameters.
Completion:
Once the model converges or a certain number of iterations is reached, the training process ends. The global model parameters in the parameter servers represent the trained model.
Advantages:
Scalability: The Parameter Server Pattern scales well with both the size of the model and the size of the data.
Flexibility: It supports both data parallelism (where different data subsets are processed in parallel) and model parallelism (where different parts of the model are processed in parallel).
Asynchronous Updates: Asynchronous updates can lead to faster convergence and better utilization of computational resources.
Challenges:
Consistency: Asynchronous updates can sometimes lead to inconsistencies in the model parameters. Techniques like "staleness control" are often employed to manage this.
Network Bottlenecks: There can be network communication overheads, especially when there are a large number of workers or the model is vast.
The Parameter Server Pattern offers a robust framework for distributed machine learning, especially beneficial for training large models on extensive datasets. However, it's essential to manage the potential challenges effectively to ensure smooth and efficient training.
Codes for the TensorFlow implementation are shared at-
Model training Architecture 2 – Checkpointing pattern
Checkpointing is a design pattern used in machine learning to save the state of a model or system at regular intervals or checkpoints. This ensures that one can recover or roll back to a saved state in case of failures, interruptions, or even to prevent loss of progress during long training sessions.
Checkpointing Pattern
Basic Concept: During the training of machine learning models, especially deep learning models, the state of the model, which includes weights, biases, and optimizer state, is saved at regular intervals or after a specific number of epochs.
These saved states, or checkpoints, can be loaded later to resume training, for evaluation, or to revert to a previously well-performing state.
Code example from Tensorflow is at : https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/checkpoint.ipynb
Example: Training a Deep Learning Model Over Multiple Days
Initialization:
A deep learning model is initialized for training on a large dataset, expected to run for several days.
Training Setup:
A checkpointing system is set up to save the model's state every hour or after every epoch, whichever is more suitable for the specific scenario.
Training Process:
As the model trains, at the end of each epoch or at the specified interval, the model's weights, biases are saved to disk or cloud storage.
In addition to periodic checkpointing, the system can be configured to save the model whenever it achieves a new best performance on the validation set.
Interruption:
Suppose there's a power outage or system crash 3 days into the training. Without checkpointing, all progress would be lost.
Recovery:
Thanks to the checkpointing pattern, once the system is back, training can be resumed from the last saved checkpoint. This prevents the need to start over from scratch.
Completion:
Once the training is completed, the saved checkpoints also provide the flexibility to select not just the final model but any previous model state. This can be useful if, for instance, the model started to overfit after a certain point and a previous state achieved better generalization.
Source: https://www.tensorflow.org/guide/checkpoint
The provided diagram visually illustrates the intricate relationships and components in a machine learning model's checkpoint, with a specific focus on the optimizer and its associated variables.
Nodes and Colors:
Optimizer (Red Node): The optimizer, which is essential for adjusting model parameters based on the gradients during training, is represented by the red node.
Optimizers such as Adam, SGD, etc., use specific parameters and stateful information for their functioning.
Regular Variables (Blue Nodes): These blue nodes represent standard model variables. They can be model parameters like weights (often denoted by terms like 'kernel') or biases ('bias').
Optimizer Slot Variables (Orange Nodes): These orange nodes symbolize 'slot' variables that the optimizer maintains for each standard model variable. These are stateful and used by the optimizer to remember certain attributes of the variables to improve optimization over time.
Other Nodes (Black): Representing components like the
tf.train.Checkpoint
, these black nodes are crucial for creating and managing checkpoints, ensuring that the state of both model variables and optimizer variables is saved and can be restored.
Edges and Relationships:
The relationships between these nodes are depicted using the edges (lines). The solid edges indicate direct associations, while the dashed edges signify conditional relationships.
The 'm' and 'v' connections to the slot variables, for instance, denote the momentum and velocity terms respectively when using an optimizer like Adam. Momentum and velocity help the optimizer to make more informed updates, considering the past gradients.
Explanation with Respect to Checkpointing:
When checkpointing in a training scenario, it's not just the model's direct parameters that need to be stored. The optimizer's state, which includes these slot variables, also needs to be saved. This ensures that when the training resumes from a checkpoint, the optimizer doesn't start from scratch but continues with its learned state, optimizing the learning process.
The dotted relationships illustrate that slot variables (like momentum for a particular weight) are stored in a checkpoint only if both the corresponding variable (e.g., the weight) and the optimizer are part of the checkpoint. This is logical because slot variables only have meaning in the context of both the variable they relate to and the optimizer that uses them.
Checkpointing in machine learning is not just about storing the model's parameters but also about capturing the complete state of the training process. This includes the optimizer's state, which is critical for effective learning.
By understanding these relationships, practitioners can design more resilient and efficient training pipelines.
Advantages of checkpointing pattern:
Fault Tolerance: In case of interruptions or failures, you can resume from the last checkpoint without losing significant progress.
Avoid Overfitting: Allows for model selection from a series of checkpoints, which can be beneficial if later stages of training led to overfitting.
Flexibility: Facilitates experimentation. Researchers can revert to a saved state to try different training strategies or hyperparameters.
Challenges of checkpointing pattern:
Storage Overheads: Especially with large models, saving frequent checkpoints can consume significant storage.
I/O Bottlenecks: Saving a model state, especially to cloud storage, can introduce I/O delays in the training process.
Conclusion:
In the enterprise ecosystem of machine learning, efficiently training models at scale remains a paramount challenge. The "Parameter Server" and "Checkpointing" design patterns emerge as pivotal techniques in this realm, each addressing unique facets of the training process.
The Parameter Server pattern offers a decentralized approach, optimizing the training process for large-scale datasets and distributed environments. By segregating parameter storage from computation nodes, it not only facilitates efficient model updates but also ensures a streamlined communication mechanism, making it an invaluable asset for organizations aiming to harness the power of distributed computing.
On the other hand, the Checkpointing pattern focuses on the resilience and longevity of the training process. In the unpredictable landscape of machine learning, where long-running processes are vulnerable to failures, Checkpointing acts as a safety net. By periodically saving the model's state, it ensures that valuable computational resources and time are preserved, allowing for recovery and continuation from the last known state.
Together, these patterns underscore a broader narrative in machine learning: the importance of robust, scalable, and resilient training processes.
As machine learning practitioners, architects, and enthusiasts, our journey towards better models is not just about algorithms, but also about the patterns and practices that enable their potential.
This is a newsletter so you may share it with others.
References:
Paper name: “Scaling Distributed Machine Learning with the Parameter Server”
TensorFlow Checkpointing at https://www.tensorflow.org/guide/checkpoint