Using DVC to Manage Machine Learning Projects

2024-07-20

/posts/20240620_dvc_pipeline/

Table of Contents

DVC (Data Version Control) is an open-source version control system for machine learning projects. It enables data scientists and machine learning engineers to manage their data, models, and experiments efficiently. This post explores how to use DVC to manage machine learning pipelines and lifecycles.

The example code can be found in this GitHub repository. You can clone the repository and follow the instructions in this post to set up the environment.

# Prerequisites

## Downloading the dataset

The used dataset can be downloaded from Kaggle.
Unzip the dataset and place it in the data/source directory.

## Setting up the environment

Create a virtual environment with Python 3.11 and activate it.

python -m venv .venv
. .venv/bin/activate

Install package manager poetry and install the dependencies. You can check the dependencies in the pyproject.toml file.

pip install poetry
poetry install

## Adding remote storage

You can use remote storage to manage your data. We use AWS S3 in the example code. You have to create a new bucket in your AWS account. In this example, the bucket name is dvc-pipeline-2024. The following command adds this bucket as the remote storage.

dvc remote add remote_storage s3://dvc-pipeline-2024

In .dvc/config file, you can see the remote storage configuration.

[core]
    remote = remote_storage
['remote "remote_storage"']
    url = s3://dvc-pipeline-2024

# Managing ML pipelines with DVC

## Defining a pipeline

In this section, we will explore how to manage a machine learning pipeline using the dvc.yaml file. The pipeline consists of the following steps:

Processing data
Splitting the data into training and testing sets
Training a model
Evaluating the model

The dvc.yaml file outlines the pipeline, detailing each corresponding stage. Each stage in our workflow comprises a command to execute, as well as its dependencies, parameters, and outputs. For instance, consider the process stage:

process:
    cmd: python src/process.py # The command to be executed
    deps: # Required files
    - data/source/Housing.csv
    params: # Used parameters defined in param.yaml
    - process.data_source_path
    - process.data_processed_path
    outs: # Output files
    - data/processed/processed.csv

The data and produced files in any states are automatically tracked by DVC, so that you don’t have to manually track them.

You can then visualize the pipeline by dvc dag command.

+-----------------------------+  
| data/source/Housing.csv.dvc |  
+-----------------------------+  
                *                
                *                
                *                
          +---------+            
          | process |            
          +---------+            
                *                
                *                
                *                
        +------------+           
        | split_data |           
        +------------+           
          **        **           
        **            *          
       *               **        
 +-------+               *       
 | train |             **        
 +-------+            *          
          **        **           
            **    **             
              *  *               
          +----------+           
          | evaluate |           
          +----------+

## Running the pipeline

The pipeline can be executed with the dvc repro command. Upon completion, a dvc.lock state file is generated as a snapshot of the results. To view the metrics of the pipeline, use the dvc metrics show command. Additionally, this pipeline produces a feature importance plot in the metrics/plots directory.

Path                  mse.test    mse.train
metrics/metrics.json  0.06426     0.01621

Imgur Image

## Run experiments with different parameters

To run experiments with different numbers of estimators, you can create a queue of experiments by the command below. The --name flag helps you to identify the experiment, and the -S flag sets the parameters. If you need to set multiple parameters, you can append -S parameter=value1,value2 to the command.

dvc exp run \
--name "n-estimator-size-second" \
--queue  \
-S "train.n_estimators=100,150,200"

Execute the experiments.

dvc exp run --run-all

Check the results of the experiments.

dvc exp show

Imgur Image

# Using DVC Studio to track experiments

## Setting up DVC Studio and adding credentials

DVC Studio is a web-based interface that allows you to share your machine learning projects with your team. To set up DVC Studio, follow the steps outlined in this guide. After the setup, when checking the project summary, you may see missing metrics. This is because the metrics are tracked by DVC and stored in remote storage rather than on GitHub.

Imgur

To allow DVC Studio to access the metrics, you need to set up credentials. First, navigate to the Settings tab of the project and click on the Data remotes / cloud storage credentials section. Then, add new credentials for the remote storage. For example, if you are using AWS S3, the credentials should have read access to the S3 buckets.

Imgur

## Automatically pushing the experiments to DVC Studio

To enable automatic pushing of the experiments to DVC Studio, run dvc studio login command to login to DVC Studio. Once you rerun your experiments as outlined in the previous section, you can view the results in DVC Studio.

Imgur

# Managing ML model lifecycles with DVC Studio

After being trained with several experiments, a model can be matured and registered in stages such as dev for review or production for deployment. However, the journey does not end after deployment. The model must be monitored and retrained to prevent degradation. In this section, we will explore how to manage the ML model lifecycle with DVC Studio.

Imgur

## Registering the model with a new version

First, go to the Models tab where you will see the house-price-predictor-model as defined in the dvc.yaml file. Click Register, enter a version number, and then click Register version in the pop-up window.

Registration will automatically create an annotated Git tag in the linked GitHub repository. You can view the tag in the Code tab of the repository. This makes it easy to track versions using the tags in the future.

Imgur

## Moving your model to a stage

After registering the model, you can assign it to a stage. Click the Assign stage button and name the stage dev. Once the stage is assigned, a new tag is automatically created in the GitHub repository to reflect the stage.

Imgur

## Setting up GitHub Actions for deployment

In the following scenario, we will move the model to the prod stage, simulating its deployment. We will use GitHub Actions to emulate the deployment process. To enable this, you need to add an access token to the GitHub repository, allowing GitHub Actions to access the DVC remote storage.

Go to the Settings tab of the project and add the DVC Studio access token to the GitHub repository secrets with the name DVC_STUDIO_TOKEN. The process is illustrated in the screenshot below.

Imgur

You can find a template GitHub Action workflow in the .github folder of the example code. Essentially, this workflow triggers on every tag creation event and checks if the stage is prod. If the stage is prod, it will download the model from remote storage and deploy it (in this example, no actual deployment occurs; only a message is printed). For a detailed explanation of this workflow, see here.

## Deploying the model

Now, move the model to the prod stage as before. After doing so, you should see that a GitHub Actions workflow is triggered and the printed message appears in the log.

Imgur

You can then repeat the above steps to create a new version of the model and assign it to different stages to start a new ML model lifecycle.

# Summary

Using dvc.yaml allows us to manage machine learning pipelines and specify artifacts and dependencies.
DVC Studio offers a web-based interface to track experiments and share projects with your team.
ML model lifecycles can be managed with DVC Studio by registering models and assigning them to different stages for deployment.

# Learn more

DIY Data Version Control (DVC). You can find more materials about DVC through the provided link.