Using DVC to Manage Machine Learning Projects
Table of Contents
DVC (Data Version Control) is an open-source version control system for machine learning projects. It enables data scientists and machine learning engineers to manage their data, models, and experiments efficiently. This post explores how to use DVC to manage machine learning pipelines and lifecycles.
The example code can be found in this GitHub repository. You can clone the repository and follow the instructions in this post to set up the environment.
#
Prerequisites
##
Downloading the dataset
- The used dataset can be downloaded from Kaggle.
- Unzip the dataset and place it in the
data/source
directory.
##
Setting up the environment
- Create a virtual environment with Python 3.11 and activate it.
python -m venv .venv
. .venv/bin/activate
- Install package manager
poetry
and install the dependencies. You can check the dependencies in thepyproject.toml
file.
pip install poetry
poetry install
##
Adding remote storage
You can use remote storage to manage your data. We use AWS S3 in the example code. You have to create a new bucket in your AWS account. In this example, the bucket name is dvc-pipeline-2024
. The following command adds this bucket as the remote storage.
dvc remote add remote_storage s3://dvc-pipeline-2024
In .dvc/config
file, you can see the remote storage configuration.
[core]
remote = remote_storage
['remote "remote_storage"']
url = s3://dvc-pipeline-2024
#
Managing ML pipelines with DVC
##
Defining a pipeline
In this section, we will explore how to manage a machine learning pipeline using the dvc.yaml
file. The pipeline consists of the following steps:
- Processing data
- Splitting the data into training and testing sets
- Training a model
- Evaluating the model
The dvc.yaml
file outlines the pipeline, detailing each corresponding stage. Each stage in our workflow comprises a command to execute, as well as its dependencies, parameters, and outputs. For instance, consider the process
stage:
process:
cmd: python src/process.py # The command to be executed
deps: # Required files
- data/source/Housing.csv
params: # Used parameters defined in param.yaml
- process.data_source_path
- process.data_processed_path
outs: # Output files
- data/processed/processed.csv
The data and produced files in any states are automatically tracked by DVC, so that you don’t have to manually track them.
You can then visualize the pipeline by dvc dag
command.
+-----------------------------+
| data/source/Housing.csv.dvc |
+-----------------------------+
*
*
*
+---------+
| process |
+---------+
*
*
*
+------------+
| split_data |
+------------+
** **
** *
* **
+-------+ *
| train | **
+-------+ *
** **
** **
* *
+----------+
| evaluate |
+----------+
##
Running the pipeline
The pipeline can be executed with the dvc repro
command. Upon completion, a dvc.lock
state file is generated as a snapshot of the results. To view the metrics of the pipeline, use the dvc metrics show
command. Additionally, this pipeline produces a feature importance plot in the metrics/plots
directory.
Path mse.test mse.train
metrics/metrics.json 0.06426 0.01621
##
Run experiments with different parameters
- To run experiments with different numbers of estimators, you can create a queue of experiments by the command below. The
--name
flag helps you to identify the experiment, and the-S
flag sets the parameters. If you need to set multiple parameters, you can append-S parameter=value1,value2
to the command.
dvc exp run \
--name "n-estimator-size-second" \
--queue \
-S "train.n_estimators=100,150,200"
- Execute the experiments.
dvc exp run --run-all
- Check the results of the experiments.
dvc exp show
#
Using DVC Studio to track experiments
##
Setting up DVC Studio and adding credentials
DVC Studio is a web-based interface that allows you to share your machine learning projects with your team. To set up DVC Studio, follow the steps outlined in this guide. After the setup, when checking the project summary, you may see missing metrics. This is because the metrics are tracked by DVC and stored in remote storage rather than on GitHub.
To allow DVC Studio to access the metrics, you need to set up credentials. First, navigate to the Settings
tab of the project and click on the Data remotes / cloud storage credentials
section. Then, add new credentials for the remote storage. For example, if you are using AWS S3, the credentials should have read access to the S3 buckets.
##
Automatically pushing the experiments to DVC Studio
To enable automatic pushing of the experiments to DVC Studio, run dvc studio login
command to login to DVC Studio. Once you rerun your experiments as outlined in the previous section, you can view the results in DVC Studio.
#
Managing ML model lifecycles with DVC Studio
After being trained with several experiments, a model can be matured and registered in stages such as dev
for review or production
for deployment. However, the journey does not end after deployment. The model must be monitored and retrained to prevent degradation. In this section, we will explore how to manage the ML model lifecycle with DVC Studio.
##
Registering the model with a new version
First, go to the Models
tab where you will see the house-price-predictor-model
as defined in the dvc.yaml
file. Click Register
, enter a version number, and then click Register version
in the pop-up window.
Registration will automatically create an annotated Git tag in the linked GitHub repository. You can view the tag in the Code
tab of the repository. This makes it easy to track versions using the tags in the future.
##
Moving your model to a stage
After registering the model, you can assign it to a stage. Click the Assign stage
button and name the stage dev
. Once the stage is assigned, a new tag is automatically created in the GitHub repository to reflect the stage.
##
Setting up GitHub Actions for deployment
In the following scenario, we will move the model to the prod
stage, simulating its deployment. We will use GitHub Actions to emulate the deployment process. To enable this, you need to add an access token to the GitHub repository, allowing GitHub Actions to access the DVC remote storage.
Go to the Settings
tab of the project and add the DVC Studio access token to the GitHub repository secrets with the name DVC_STUDIO_TOKEN
. The process is illustrated in the screenshot below.
You can find a template GitHub Action workflow in the .github
folder of the example code. Essentially, this workflow triggers on every tag creation event and checks if the stage is prod
. If the stage is prod
, it will download the model from remote storage and deploy it (in this example, no actual deployment occurs; only a message is printed). For a detailed explanation of this workflow, see here.
##
Deploying the model
Now, move the model to the prod
stage as before. After doing so, you should see that a GitHub Actions workflow is triggered and the printed message appears in the log.
You can then repeat the above steps to create a new version of the model and assign it to different stages to start a new ML model lifecycle.
#
Summary
- Using
dvc.yaml
allows us to manage machine learning pipelines and specify artifacts and dependencies. - DVC Studio offers a web-based interface to track experiments and share projects with your team.
- ML model lifecycles can be managed with DVC Studio by registering models and assigning them to different stages for deployment.
#
Learn more
- DIY Data Version Control (DVC). You can find more materials about DVC through the provided link.