{"id":3886,"date":"2022-05-25T13:04:35","date_gmt":"2022-05-25T11:04:35","guid":{"rendered":"https:\/\/dev.littlebigcode.fr\/how-dvc-manages-data-sets-training-ml-models-git\/"},"modified":"2022-07-04T23:27:39","modified_gmt":"2022-07-04T21:27:39","slug":"how-dvc-manages-data-sets-training-ml-models-git","status":"publish","type":"post","link":"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/","title":{"rendered":"MLOps : How DVC smartly manages your data sets for training your machine learning models on top of Git"},"content":{"rendered":"<p>[et_pb_section fb_built=&#8221;1&#8243; admin_label=&#8221;section&#8221; _builder_version=&#8221;4.16&#8243; da_disable_devices=&#8221;off|off|off&#8221; global_colors_info=&#8221;{}&#8221; da_is_popup=&#8221;off&#8221; da_exit_intent=&#8221;off&#8221; da_has_close=&#8221;on&#8221; da_alt_close=&#8221;off&#8221; da_dark_close=&#8221;off&#8221; da_not_modal=&#8221;on&#8221; da_is_singular=&#8221;off&#8221; da_with_loader=&#8221;off&#8221; da_has_shadow=&#8221;on&#8221;][et_pb_row admin_label=&#8221;row&#8221; _builder_version=&#8221;4.16&#8243; background_size=&#8221;initial&#8221; background_position=&#8221;top_left&#8221; background_repeat=&#8221;repeat&#8221; custom_padding=&#8221;3px||3px||true|false&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.16&#8243; custom_padding=&#8221;|||&#8221; global_colors_info=&#8221;{}&#8221; custom_padding__hover=&#8221;|||&#8221;][et_pb_text admin_label=&#8221;Text&#8221; _builder_version=&#8221;4.17.4&#8243; text_font=&#8221;Average Sans||||||||&#8221; text_text_color=&#8221;#242B57&#8243; link_font=&#8221;Average Sans||||||||&#8221; link_text_color=&#8221;#1CACE4&#8243; ul_font=&#8221;Average Sans||||||||&#8221; ul_text_color=&#8221;#242B57&#8243; ol_text_color=&#8221;#242B57&#8243; quote_font=&#8221;Average Sans||||||||&#8221; quote_text_color=&#8221;#242B57&#8243; header_text_color=&#8221;#1CACE4&#8243; header_2_text_color=&#8221;#1CACE4&#8243; header_3_text_color=&#8221;#1CACE4&#8243; header_4_font=&#8221;Average Sans||||||||&#8221; header_4_text_color=&#8221;#1CACE4&#8243; header_5_font=&#8221;Century Gothic Bold||||||||&#8221; header_5_text_color=&#8221;#1CACE4&#8243; header_6_font=&#8221;Century Gothic Bold||||||||&#8221; header_6_text_color=&#8221;#1CACE4&#8243; background_size=&#8221;initial&#8221; background_position=&#8221;top_left&#8221; background_repeat=&#8221;repeat&#8221; custom_padding=&#8221;8px|||||&#8221; inline_fonts=&#8221;Century Gothic Bold,Century Gothic,Average Sans&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<p style=\"text-align: justify;\" data-renderer-start-pos=\"3\">This article belongs to a series of articles about MLOps tools and practices for data and model experiment tracking. In the first part, we explained why data and model experiment tracking was important, and how tools like DVC and Mlflow could solve this challenge. Today, we\u2019ll see how Data Version Control (DVC) smartly manages your data sets for training your machine learning models on top of Git.<\/p>\n<p style=\"text-align: justify;\" data-renderer-start-pos=\"3\">By <a href=\"https:\/\/www.linkedin.com\/in\/samson-zhang-887135115\/\">Samson ZHANG<\/a>, Data Scientist at LittleBigCode<\/p>\n<p style=\"text-align: justify;\" data-renderer-start-pos=\"3\"><span id=\"43e87c34-ef6b-4af2-9050-12bf6aa42cf2\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"43e87c34-ef6b-4af2-9050-12bf6aa42cf2\">What are we talking about? DVC is a MLOps tool that works on top of Git repositories and has a <\/span><span id=\"2c3f06a0-1615-4906-9e5f-c7af0b0716df\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"2c3f06a0-1615-4906-9e5f-c7af0b0716df\"><span id=\"43e87c34-ef6b-4af2-9050-12bf6aa42cf2\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"43e87c34-ef6b-4af2-9050-12bf6aa42cf2\">similar command line interface and workflow to <\/span><\/span><span id=\"43e87c34-ef6b-4af2-9050-12bf6aa42cf2\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"43e87c34-ef6b-4af2-9050-12bf6aa42cf2\">Git. It is designed to tackle the challenge of data sets traceability and reproducibility<\/span> when training data-driven models.<\/p>\n<p>&nbsp;<\/p>\n<h1 data-renderer-start-pos=\"1458\">Why do we need DVC ?<\/h1>\n<p>&nbsp;<\/p>\n<p data-renderer-start-pos=\"2071\">All data-driven models require data to be trained. Managing and creating the data sets used for training data-driven models requires a lot of time and space. Depending on the project, there can be up to thousands of versions of the data set to train the models. This can quickly become muddled due to multiple users altering and updating the data which can greatly jeopardize the traceability and reproducilibity of experiments.<\/p>\n<p data-renderer-start-pos=\"2501\">In your data scientist career, you probably experienced data versions tracking issues when exploring and cleaning your data set, just like me.<\/p>\n<p data-renderer-start-pos=\"2647\"><span id=\"c4bb5b65-3c47-4200-9c56-237f82efe6a6\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"c4bb5b65-3c47-4200-9c56-237f82efe6a6\">For<\/span> instance, I often worked on computer vision problems with thousands of images\/annotation files. Counting the raw noisy data, the cleaned data and the preprocessed data, there are already 3 different versions to keep. And that is still without keeping track of some <span id=\"38d1c691-a5cd-4be8-91c3-e1eebc7c56f7\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"38d1c691-a5cd-4be8-91c3-e1eebc7c56f7\">processing steps results<\/span>!<\/p>\n<p data-renderer-start-pos=\"2944\">Without DVC, a possible approach would be zipping files and storing hashes (file content checksum), and locations in Git commits. The data set would be fully duplicated for each version. It would be complicated to update and to keep track of. Just imagine the work you would have to do each and every time you have new data to add or wrong labels to correct!<\/p>\n<p data-renderer-start-pos=\"3304\">This iterative process on the data set can <span id=\"e6770e91-05b4-465b-bb46-436a65366fff\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"e6770e91-05b4-465b-bb46-436a65366fff\">be applied <\/span>to many data science projects and it is not scalable without proper tools.<\/p>\n<p data-renderer-start-pos=\"3304\">DVC has been created to exactly handle this iterative process in an efficient way.<\/p>\n<p data-renderer-start-pos=\"3304\"><div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'>As its name suggests, DVC is a data-versioning tool. It is first designed to track sequential updates of a data set. It is not designed to handle complex tasks such as concurrent labeling and data set cleansing (file removal and modification) in a multiple-users setting. A third-party specialized tool would be more appropriate for those tasks.<\/div><\/div><\/p>\n<h2 id=\"Why-use-DVC-for-data-version-management-instead-of-other-tools-such-as-Git-or-Mlflow-?\" data-renderer-start-pos=\"3870\">Why use DVC for data version management instead of other tools such as Git or Mlflow ?<\/h2>\n<p data-renderer-start-pos=\"3958\">Mlflow is not designed to track a lot of large files (for instance, thousands of images) as it does not optimize storage for file duplication. Tracking datasets version with Mlflow would be inefficient. Mlflow itself does not guarantee the reproducibility of a data set used during an experiment run, unless you save the whole data set during each run, which is not scalable.<\/p>\n<p data-renderer-start-pos=\"4335\">G<span id=\"2549793a-c6f4-4e96-b941-0c71f9d0f133\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"2549793a-c6f4-4e96-b941-0c71f9d0f133\">it is unsuited for large files versioning in general<\/span> (especially for datasets). Furthermore, saving your data set with your source code can be a huge security breach as anybody that works on the code can access potentially <span id=\"5b8523f0-3155-45a8-870a-97c33866878e\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"5b8523f0-3155-45a8-870a-97c33866878e\">sensitive<\/span> data (even worse for public git repo).<\/p>\n<p data-renderer-start-pos=\"4335\"><div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'>Git-LFS exists as a git extension for large file storage but it has not been designed with data science in mind. <a class=\"css-bspq7p\" title=\"https:\/\/dvc.org\/doc\/user-guide\/related-technologies#git-lfs-large-file-storage\" href=\"https:\/\/dvc.org\/doc\/user-guide\/related-technologies#git-lfs-large-file-storage\" data-renderer-mark=\"true\">DVC: About Git-LFS<\/a>. DVC can work with any cloud storage (or even a simple SSH server) and does not require a dedicated LFS server unlike Git-LFS.<\/div><\/div><\/p>\n<p data-renderer-start-pos=\"4871\">Those are the main reasons that motivate the use of an additional type of tool for data versioning such as DVC for improving your ML<span id=\"bee4b124-d6b9-4459-9874-134463cc66ec\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"bee4b124-d6b9-4459-9874-134463cc66ec\"> <\/span>experiments tracking experience. DVC complements Mlflow and <span id=\"0af2a8b8-0e27-405f-9582-f218fd33c77c\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"0af2a8b8-0e27-405f-9582-f218fd33c77c\">Git in order to provide a complete<\/span> ML tracking experience.<\/p>\n<p data-renderer-start-pos=\"5124\">Technically, DVC is a file-versioning tool that can work with <span id=\"3c0e2009-e48b-4afd-926a-296f379e3b1d\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"3c0e2009-e48b-4afd-926a-296f379e3b1d\">any type of data<\/span> (image, text, video) as it saves files. But the latter does not mean that it is adapted to version complex data types for ML purposes such as large video files because DVC simply tracks file versions with hashes (content checksum). For instance, a few seconds modification to an 1h-long video file (several GBs) results in 2 full 1h-long video files stored which implies a lot of duplication.<\/p>\n<h2 data-hook=\"rcv-block15\"><span style=\"color: #242b57; font-size: x-large;\">How does it work ?<\/span><\/h2>\n<p style=\"text-align: justify;\" data-renderer-start-pos=\"3\"><span id=\"43e87c34-ef6b-4af2-9050-12bf6aa42cf2\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"43e87c34-ef6b-4af2-9050-12bf6aa42cf2\">You can version datasets in your Git repository by only storing small *.dvc metafiles (text) tracked by Git commits (cf. figure 1). It has an optimized versioning capability like git by only storing the minimal quantity of information to describe <\/span>the data across all the data set versions in a repository. <strong data-renderer-mark=\"true\">The same file appearing in multiple data set versions is stored only once<\/strong>.<\/p>\n<div id=\"attachment_3715\" style=\"width: 528px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-3715\" class=\"wp-image-3715 aligncenter\" src=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220507-222806.png\" alt=\"\" width=\"518\" height=\"373\" srcset=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220507-222806.png 518w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220507-222806-480x346.png 480w\" sizes=\"(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 518px, 100vw\" \/><p id=\"caption-attachment-3715\" class=\"wp-caption-text\">Figure 1. How DVC works. Source: <a class=\"css-bspq7p\" title=\"http:\/\/DVC.org\" href=\"http:\/\/dvc.org\/\" data-renderer-mark=\"true\">DVC.org<\/a><\/p><\/div>\n<p>&nbsp;<\/p>\n<h2 data-hook=\"rcv-block15\"><span style=\"color: #242b57; font-size: x-large;\">Project structure<\/span><\/h2>\n<p data-renderer-start-pos=\"6065\"><span id=\"1789d691-f462-42e3-a4bc-973dbd17e776\" class=\"inline-highlight\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"1789d691-f462-42e3-a4bc-973dbd17e776\">A DVC repository is a Git repository that tracks DVC files. <\/span><span id=\"dc41ad82-5a84-42ce-80ff-42004ac6de81\" class=\"inline-highlight\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"dc41ad82-5a84-42ce-80ff-42004ac6de81\">Setting up a DVC repository and do data versioning <\/span>is easy.<\/p>\n<p data-renderer-start-pos=\"6186\">Let\u2019s take a look at the composition of a DVC repository :<\/p>\n<p><script src=\"https:\/\/gist.github.com\/zhangsamson\/a589e3c1c79064810f75d8189607c150.js\"><\/script><\/p>\n<p><span id=\"60419152-9433-458d-a2d2-ea338f23f482\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"60419152-9433-458d-a2d2-ea338f23f482\"><\/span> For a Git repository to be also be a DVC repository, there are only 2 elements needed:<\/p>\n<p>\u25ba.dvc\/ subdirectory at the project\u2019s root. This directory mainly contains customizable config files. By default, it also contains the DVC repository cache;<\/p>\n<p>\u25ba*.dvc files<\/p>\n<p>DVC files (<span id=\"fd09e3cb-8d94-433a-a1fa-2552afa3ba5d\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"fd09e3cb-8d94-433a-a1fa-2552afa3ba5d\">*.dvc<\/span>) are the entry points for versioning data. They are metafiles used by DVC to point to the data in a storage space. DVC files and an URI of the data storage space (local file system, AWS, Azure, GCP\u2026) are the only information needed for versioning data sets. You can think that *.dvc files are like <span id=\"501a33e8-cf0c-4cc1-9398-1e40bac6abd7\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"501a33e8-cf0c-4cc1-9398-1e40bac6abd7\">indexes,<\/span> they are light and easily versionable addresses that point to<span id=\"b951bcce-b167-49ee-abd3-831068e3ae26\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"b951bcce-b167-49ee-abd3-831068e3ae26\"> <\/span>the actual data stored in a more suited storage space (cloud, local remote storage).<\/p>\n<p>It means that *.dvc files have to be tracked by Git, in order to track different versions of a data set. Conversely, if a data set version pointed by a .dvc file is not tracked by Git, it can become inaccessible (it is not designed to be accessed without .dvc files) but the data will still exist in the storage. <div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'><span id=\"41fcce75-ee71-45b5-88d0-ec22d4e2a4a9\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"41fcce75-ee71-45b5-88d0-ec22d4e2a4a9\">A \u201cremote\u201d storage and a \u201cremote\u201d cache for a DVC repository are respectively a storage space and a cache directory that are just outside of the git repository. <\/span><span id=\"622033db-caec-4bb2-b8c1-724c8c6c1d78\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"622033db-caec-4bb2-b8c1-724c8c6c1d78\"><span id=\"41fcce75-ee71-45b5-88d0-ec22d4e2a4a9\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"41fcce75-ee71-45b5-88d0-ec22d4e2a4a9\">By default, if no remote cache directory is set up, DVC just locally saves cache files to the root of the git repository at .dvc\/cache\/. <\/span><\/span><\/div><\/div><\/p>\n<h2 data-hook=\"rcv-block15\"><span style=\"color: #242b57; font-size: x-large;\">Basic commands<\/span><\/h2>\n<p>Like Git, DVC is configurable (<span id=\"33fb6bea-f795-4fb6-affe-60dc61f56615\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"33fb6bea-f795-4fb6-affe-60dc61f56615\">remote storage<\/span>, scope) and has \u201cadd\u201d, \u201cpush\u201d, \u201cpull\u201d, \u201c<span id=\"34ff8269-9ebc-4a82-9618-cca02a1aff8b\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"34ff8269-9ebc-4a82-9618-cca02a1aff8b\">checkout<\/span>\u201d commands for managing your data files. DVC is compatible with all the main cloud providers: Google Cloud, Microsoft Azure and AWS S3, and it does not have any infrastructure requirements. \u00a0 \u00a0 <script src=\"https:\/\/gist.github.com\/zhangsamson\/802b69f31d79625bf2602e4f4928d073.js\"><\/script><\/p>\n<h2 data-hook=\"rcv-block15\"><span style=\"color: #242b57; font-size: x-large;\">How DVC manages data set versions and avoids duplication<\/span><\/h2>\n<p data-renderer-start-pos=\"9196\">The local DVC cache (<a class=\"css-bspq7p\" title=\"https:\/\/dvc.org\/doc\/user-guide\/large-dataset-optimization\" href=\"https:\/\/dvc.org\/doc\/user-guide\/large-dataset-optimization\" data-renderer-mark=\"true\">DVC Cache structure<\/a>) contains all the versioned data sets <strong data-renderer-mark=\"true\">without file duplicates between versions<\/strong>. This cache can be anywhere on the local system. A working copy of this cache is duplicated with an user-specified file link (copy, reflink, hardlink, symlink) <a class=\"css-bspq7p\" title=\"https:\/\/dvc.org\/doc\/user-guide\/large-dataset-optimization\" href=\"https:\/\/dvc.org\/doc\/user-guide\/large-dataset-optimization\" data-renderer-mark=\"true\">dvc link types<\/a> into the Git repository workspace for the files to be accessed by the project.<\/p>\n<p data-renderer-start-pos=\"9572\">By default, the copy strategy is used. For more details about the file link type, check out the dedicated section \u201cConfigure your DVC cache\u201d of this article.<\/p>\n<div id=\"attachment_3718\" style=\"width: 644px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-3718\" class=\"wp-image-3718 \" src=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220209-232820.png\" alt=\"\" width=\"634\" height=\"378\" srcset=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220209-232820.png 634w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220209-232820-480x286.png 480w\" sizes=\"(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 634px, 100vw\" \/><p id=\"caption-attachment-3718\" class=\"wp-caption-text\">Figure 2. DVC workflow, cache and storage<\/p><\/div>\n<p>&nbsp;<\/p>\n<p>In order to explain how DVC cache optimizes storage space by avoiding files duplication between different versions of a data set, let us look at an <span id=\"b9253296-609a-4fac-8db0-1ae65d386901\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"b9253296-609a-4fac-8db0-1ae65d386901\">example<\/span>:<\/p>\n<div id=\"attachment_3720\" style=\"width: 591px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-3720\" class=\"wp-image-3720 \" src=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220227-185137.png\" alt=\"\" width=\"581\" height=\"1223\" srcset=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220227-185137.png 581w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220227-185137-480x1010.png 480w\" sizes=\"(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 581px, 100vw\" \/><p id=\"caption-attachment-3720\" class=\"wp-caption-text\">Figure 3. DVC cache workflow<\/p><\/div>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify;\" data-renderer-start-pos=\"3\">Each \u201cdvc add\u201d command uploads a new version of the data set (cf. figure 3). Each file is <strong data-renderer-mark=\"true\">saved only once<\/strong> (for any version) and the no-duplication is ensured by file checksum comparison between versions. An internal database maps each file to each data version it belongs to.<\/p>\n<p data-renderer-start-pos=\"3\"><div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'>\u201cdvc add\u201d would be equivalent to \u201cgit add\u201d + \u201cgit commit\u201d commands for git<\/div><\/div><\/p>\n<p data-renderer-start-pos=\"3\"><div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'>Use DVC garbage collector <span data-inline-card=\"true\" data-card-url=\"https:\/\/dvc.org\/doc\/command-reference\/gc\"><span class=\"loader-wrapper\"><a class=\"css-1xdzogm eeajecn0\" tabindex=\"0\" role=\"button\" href=\"https:\/\/dvc.org\/doc\/command-reference\/gc\" data-testid=\"inline-card-resolved-view\"><span class=\"css-1p7ax5 e158gagu2\"><span class=\"smart-link-title-wrapper css-0 e158gagu8\">gc<\/span><\/span><\/a><\/span> <\/span> to remove files untracked by DVC (from the current repository) that were pushed to the storage.<\/div><\/div><\/p>\n<p data-renderer-start-pos=\"3\"><div class='et-box et-warning'>\n\t\t\t\t\t<div class='et-box-content'>If you share a data set with multiple projects and you do custom modifications in your downstream projects, the <span data-inline-card=\"true\" data-card-url=\"https:\/\/dvc.org\/doc\/command-reference\/gc\"><a class=\"css-1xdzogm eeajecn0\" tabindex=\"0\" role=\"button\" href=\"https:\/\/dvc.org\/doc\/command-reference\/gc\" data-testid=\"inline-card-resolved-view\"><span class=\"css-1p7ax5 e158gagu2\"><span class=\"smart-link-title-wrapper css-0 e158gagu8\">gc<\/span><\/span><\/a><\/span> command, from a given downstream project, erases data that are not tracked by the latter even though it is tracked by other projects. <strong data-renderer-mark=\"true\">This operation is irreversible<\/strong>. It can be useful to create backups when manipulating deletion commands.<\/div><\/div><\/p>\n<p>Even though DVC is built on top of Git, DVC does not have a history system like Git. There are no explicit branching logic and commit dependencies handled by DVC itself. DVC only reasons on <span id=\"4c5d6172-62e1-457a-bd3f-c5c4baef6fea\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"4c5d6172-62e1-457a-bd3f-c5c4baef6fea\">file presence<\/span> and content, by checking hashes, to determine data versions. The dependency logic between data versions is handled by Git history. It means that you can create different data set versions on d<span id=\"5ef00331-5fcf-4e02-a775-3b21d3da5e56\" class=\"inline-highlight\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"5ef00331-5fcf-4e02-a775-3b21d3da5e56\">ifferent Git branches and DVC <\/span>maps each file, ever tracked in the repository, to every commit\/branches that tracks it.<\/p>\n<div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'>One consequence is that DVC can work stand-alone (without Git) for saving files but without versioning capabilities.<\/div><\/div>\n<h2 data-hook=\"rcv-block15\"><span style=\"color: #242b57; font-size: x-large;\">DVC best practices<\/span><\/h2>\n<p data-renderer-start-pos=\"11467\">After some experimentation with DVC, there are few good practices I think one should pick up when using DVC:<\/p>\n<p data-renderer-start-pos=\"11467\"><span id=\"13d48a87-233e-4f2f-9b46-60ed160a8f0c\" class=\"inline-highlight\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"13d48a87-233e-4f2f-9b46-60ed160a8f0c\">\u2022 Use DVC only for data-related tasks such as data set versioning, data processing routines. Not for logging <\/span><span id=\"6eb2277f-a473-4c67-acb7-0d94266e5c14\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"6eb2277f-a473-4c67-acb7-0d94266e5c14\"><span id=\"13d48a87-233e-4f2f-9b46-60ed160a8f0c\" class=\"inline-highlight\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"13d48a87-233e-4f2f-9b46-60ed160a8f0c\">experiment metrics<\/span><\/span><span id=\"13d48a87-233e-4f2f-9b46-60ed160a8f0c\" class=\"inline-highlight\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"13d48a87-233e-4f2f-9b46-60ed160a8f0c\"> and model weights. Even though the <\/span><span id=\"66efa751-3577-4a5d-8c75-67098623172b\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"66efa751-3577-4a5d-8c75-67098623172b\"><span id=\"13d48a87-233e-4f2f-9b46-60ed160a8f0c\" class=\"inline-highlight\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"13d48a87-233e-4f2f-9b46-60ed160a8f0c\">DVC documentation<\/span><\/span><span id=\"13d48a87-233e-4f2f-9b46-60ed160a8f0c\" class=\"inline-highlight\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"13d48a87-233e-4f2f-9b46-60ed160a8f0c\"> indicates that those features exist with its versioning capability, DVC is not designed to do experiment runs performance comparison, unlike MLflow, without many tricks and saving unnecessary files to the code base.<\/span><\/p>\n<p data-renderer-start-pos=\"11467\">\u2022 Use <span data-inline-card=\"true\" data-card-url=\"https:\/\/dvc.org\/doc\/use-cases\/data-registries\"><span class=\"loader-wrapper\"><a class=\"css-1xdzogm eeajecn0\" tabindex=\"0\" role=\"button\" href=\"https:\/\/dvc.org\/doc\/use-cases\/data-registries\" data-testid=\"inline-card-resolved-view\"><span class=\"css-1p7ax5 e158gagu2\"><span class=\"smart-link-title-wrapper css-0 e158gagu8\">Data Registry<\/span><\/span><\/a><\/span><\/span> whenever <span id=\"5bce8191-c628-49da-96f8-564e7de6b265\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"5bce8191-c628-49da-96f8-564e7de6b265\">possible<\/span> in order to centralize data sets that can be shared in different projects. A DVC data registry is basically a Git repository that only contains DVC files (no code) that can version as many data sets as your organization have. An example of data registry setup is developped in the next section of this article.<\/p>\n<p data-renderer-start-pos=\"11467\">\u2022 Unless you explicitly want to share your project\u2019s DVC configuration such as a remote storage URL for a data registry, <strong data-renderer-mark=\"true\">never<\/strong> use global configuration (.dvc\/config). Prefer your project\u2019s private configuration <strong data-renderer-mark=\"true\">.dvc\/config.local<\/strong> instead by using the <strong data-renderer-mark=\"true\">&#8211;local<\/strong> argument to your configuration-modifying commands. Most of the time, your configuration depends on your local workspace (cache location\/type) and you might need to use secrets for cloud remote storage (azure credentials,\u2026). There is no reason to use .dvc\/config for it.<\/p>\n<p data-renderer-start-pos=\"11467\"><script src=\"https:\/\/gist.github.com\/zhangsamson\/b9deff98a06417813ced42b7c8437364.js\"><\/script> <span id=\"53bb61c9-b345-4a2a-90b4-c5f2cff01d63\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"53bb61c9-b345-4a2a-90b4-c5f2cff01d63\">\u2022 Write descriptive Git commits when versioning data sets, otherwise it can become hard to track meaningful changes in the data sets<\/span>. This applies to software engineering in general. \u2022 Configure DVC repository cache. Do not use default when possible. Use an external cache if you have limited storage resource in your Git repository workspace. If you do not need to edit your DVC-tracked files in place, change your cache type link to save space from copy (default) to reflink,hardlink,symlink <span data-inline-card=\"true\" data-card-url=\"https:\/\/dvc.org\/doc\/user-guide\/large-dataset-optimization\"><span class=\"loader-wrapper\"><a class=\"css-1xdzogm eeajecn0\" tabindex=\"0\" role=\"button\" href=\"https:\/\/dvc.org\/doc\/user-guide\/large-dataset-optimization\" data-testid=\"inline-card-resolved-view\"><span class=\"css-1p7ax5 e158gagu2\"><span class=\"smart-link-title-wrapper css-0 e158gagu8\">Large Dataset Optimization<\/span><\/span><\/a><\/span><\/span>. Most of the time, the best cache configurations are: {reflink,hardlink,symlink}+external cache dir on large disks for SSD(small)+HDD\/SSD(large) hardware configuration. \u2022 Use <a class=\"css-bspq7p\" title=\"https:\/\/dvc.org\/doc\/command-reference\/install\" href=\"https:\/\/dvc.org\/doc\/command-reference\/install\" data-renderer-mark=\"true\">DVC, Git hooks<\/a> for common routine automation (post-checkout, pre-commit, pre-push). In a DVC repository, use \u201cdvc install\u201d command to set up hooks.<\/p>\n<h2 data-hook=\"rcv-block15\"><span style=\"color: #242b57; font-size: x-large;\">Go further in setting up your DVC projects !<\/span><\/h2>\n<p>&nbsp;<\/p>\n<h2 id=\"Set-up-a-data-set-registry\" data-renderer-start-pos=\"13769\">Set up a data set registry<\/h2>\n<p>DVC is simple to use as it is a thin layer over git repositories. One can directly use an existing Git repository in order to build a DVC repository on top of it and do data versioning for the given git project. The given Git repo would be the main entry point for accessing data set versions. It can be enough for doing experimentation by yourself on a single project, but it would easily become inefficient when you want to share the data sets with other projects. In practice, datasets are often re-used in multiple projects. This is why you should set up a data registry (cf. figure 4) whenever possible. <div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'>A data registry is composed of a DVC repository and a remote storage. You <strong data-renderer-mark=\"true\">need<\/strong> to setup a default remote storage for your data registry (cf. \u201cdvc remote add -d\u201d command). If you do not set up your default remote storage from start and you only set up your remote storage in the later stages of your project, you may encounter issues trying to checkout DVC-tracked files from earlier git commits in your project when using <a class=\"css-bspq7p\" title=\"https:\/\/dvc.org\/doc\/command-reference\/import\" href=\"https:\/\/dvc.org\/doc\/command-reference\/import\" data-renderer-mark=\"true\">dvc import<\/a>, <a class=\"css-bspq7p\" title=\"https:\/\/dvc.org\/doc\/command-reference\/update\" href=\"https:\/\/dvc.org\/doc\/command-reference\/update\" data-renderer-mark=\"true\">dvc update<\/a>, <a class=\"css-bspq7p\" title=\"https:\/\/dvc.org\/doc\/command-reference\/checkout\" href=\"https:\/\/dvc.org\/doc\/command-reference\/checkout\" data-renderer-mark=\"true\">dvc checkout<\/a> commands as the .dvc\/config (project default config) would be empty.<\/div><\/div><\/p>\n<div id=\"attachment_3722\" style=\"width: 710px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-3722\" class=\"wp-image-3722 size-full\" src=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220130-202439.png\" alt=\"\" width=\"700\" height=\"569\" srcset=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220130-202439.png 700w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220130-202439-480x390.png 480w\" sizes=\"(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 700px, 100vw\" \/><p id=\"caption-attachment-3722\" class=\"wp-caption-text\">Figure 4. Data registry in DVC. Source : <span data-inline-card=\"true\" data-card-url=\"http:\/\/dvc.org\"><span class=\"loader-wrapper\"><a class=\"css-1xdzogm eeajecn0\" tabindex=\"0\" role=\"button\" href=\"http:\/\/dvc.org\/\" data-testid=\"inline-card-resolved-view\"><span class=\"css-1p7ax5 e158gagu2\"><span class=\"smart-link-title-wrapper css-0 e158gagu8\">Data Version Control \u00b7 DVC<\/span><\/span><\/a><\/span><\/span><\/p><\/div>\n<p>1 \u2022 Let\u2019s start from scratch. First create a new Git repository and initialize the dvc repository <span data-inline-card=\"true\" data-card-url=\"https:\/\/dvc.org\/doc\/command-reference\/init\"><span class=\"loader-wrapper\"><a class=\"css-1xdzogm eeajecn0\" tabindex=\"0\" role=\"button\" href=\"https:\/\/dvc.org\/doc\/command-reference\/init\" data-testid=\"inline-card-resolved-view\"><span class=\"css-1p7ax5 e158gagu2\"><span class=\"smart-link-title-wrapper css-0 e158gagu8\">init<\/span><\/span><\/a><\/span><\/span> on top of it. I recommend you creating a new conda environment for it. \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 <script src=\"https:\/\/gist.github.com\/zhangsamson\/60ef3c774f988bc81d4a2287e23e8778.js\"><\/script><br \/>\n(Highly recommended) Configure git hooks for DVC install<br \/>\n<script src=\"https:\/\/gist.github.com\/zhangsamson\/7fda39f1b5a00cd9014e73dc132ff2de.js\"><\/script>\u00a0 \u00a0 \u00a0 \u00a0 2 \u2022 <span data-inline-card=\"true\" data-card-url=\"https:\/\/dvc.org\/doc\/command-reference\/remote\/add\"><span class=\"loader-wrapper\"><a class=\"css-1xdzogm eeajecn0\" tabindex=\"0\" role=\"button\" href=\"https:\/\/dvc.org\/doc\/command-reference\/remote\/add\" data-testid=\"inline-card-resolved-view\"><span class=\"css-1p7ax5 e158gagu2\"><span class=\"smart-link-title-wrapper css-0 e158gagu8\">dvc remote add<\/span><\/span><\/a><\/span><\/span> Set up the remote storage for your DVC repository, it can either be a local file storage or a remote storage and commit it. <span id=\"9e07ddcc-b557-45e1-95b8-5b9921b74c1c\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"9e07ddcc-b557-45e1-95b8-5b9921b74c1c\"><span id=\"89421f45-eddc-4c05-90e3-04f9f1b2fe8e\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"89421f45-eddc-4c05-90e3-04f9f1b2fe8e\">(optional) <\/span><\/span><span id=\"89421f45-eddc-4c05-90e3-04f9f1b2fe8e\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"89421f45-eddc-4c05-90e3-04f9f1b2fe8e\">install dvc dependencies for cloud remote storage if necessary (example for azure):<\/span> \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 <script src=\"https:\/\/gist.github.com\/zhangsamson\/1202f9dd7838b28de11b0ff7e93b1d1f.js\"><\/script><\/p>\n<p>For local file storage :<br \/>\n<script src=\"https:\/\/gist.github.com\/zhangsamson\/009b5dae5c2962a96118d7dc095ca383.js\"><\/script> This modifies the content of .dvc\/config in your repository, that represents your project\u2019s global configuration. You can check config for more configuration options. <script src=\"https:\/\/gist.github.com\/zhangsamson\/2779d65bd138076c96c75f337bffd8d5.js\"><\/script><\/p>\n<p><span id=\"f6acab40-5d26-4bb1-971b-03488d2df545\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"f6acab40-5d26-4bb1-971b-03488d2df545\">For Azure storage, assuming you have enough credentials, the following commands modify your .dvc\/config file for the remote URL and modifies the .dvc\/config.local that stores credential only locally. You can find other examples for settings up your Azure storage here <\/span><span data-inline-card=\"true\" data-card-url=\"https:\/\/dvc.org\/doc\/command-reference\/remote\/modify#example-some-azure-authentication-methods\"><span class=\"loader-wrapper\"><a class=\"css-1xdzogm eeajecn0\" tabindex=\"0\" role=\"button\" href=\"https:\/\/dvc.org\/doc\/command-reference\/remote\/modify#example-some-azure-authentication-methods\" data-testid=\"inline-card-resolved-view\"><span class=\"css-1p7ax5 e158gagu2\"><span class=\"smart-link-title-wrapper css-0 e158gagu8\">remote modify<\/span><\/span><\/a><\/span><\/span> but it is recommended to use a SAS token.<\/p>\n<p><script src=\"https:\/\/gist.github.com\/zhangsamson\/381f71c8ac848067c62b78c68496620d.js\"><\/script> 3 \u2022 Commit your remote storage config <script src=\"https:\/\/gist.github.com\/zhangsamson\/8d4d868232a2d24f5c16814c16b70210.js\"><\/script><\/p>\n<div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'><span id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\">.dvc\/config is meant to be tracked by Git<\/span><span id=\"11627aa7-aca4-48ad-ab11-462565b04753\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"11627aa7-aca4-48ad-ab11-462565b04753\"><span id=\"36ac4cf3-4ea7-430a-93a3-52febf4d8203\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"36ac4cf3-4ea7-430a-93a3-52febf4d8203\">. You should only put configuration parameters that can be shared such as cloud container URLs (no secrets, SSH keys etc). It is well-suited for DVC data registries. In general, you should put custom configurations in <\/span><\/span><span id=\"46d72a39-6a3e-4e23-9981-a1dda07523ba\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"46d72a39-6a3e-4e23-9981-a1dda07523ba\">.dvc\/config.local<\/span> (that should be ignored with .gitignore) which is not tracked by Git (custom cache location, cache link-type, etc). Use &#8211;local argument to configuration-changing commands.<\/div><\/div>\n<p><span id=\"a69f32a6-d830-43ae-8aa1-05ccbb89f224\" class=\"inline-highlight\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"a69f32a6-d830-43ae-8aa1-05ccbb89f224\">4 \u2022 Download your first data set version to track with DVC. Note that we are downloading data from a public DVC data registry using dvc cli but you can retrieve data however you want<\/span>:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/zhangsamson\/d37813c5a544c3b81f1595f5d6a3ce78.js\"><\/script> 5 \u2022 Track your data set with DVC and git commit *.dvc files to version your data set: <script src=\"https:\/\/gist.github.com\/zhangsamson\/306545b89bd5d44d0d6a77b308735da5.js\"><\/script><\/p>\n<p>Check the content of the cats_vs_dogs.dvc. It contains meta data useful to DVC in order to track your data set in the remote : <script src=\"https:\/\/gist.github.com\/zhangsamson\/1b864ad4362b2a8b9e9aa70074ea3343.js\"><\/script> 6 \u2022\u00a0 Push your data set to the remote storage. At this point, you can already use this DVC repository in other projects. If you set up a cloud remote storage and push your Git repository to Github\/Gitlab, you can even share your data registry with anybody: <script src=\"https:\/\/gist.github.com\/zhangsamson\/7efd9def34e944bec2fcf0666a33b0d6.js\"><\/script><br \/>\n7\u2022 Modify your data set by adding new images and create a data set version (like steps 3, 4 and 5):<br \/>\n<script src=\"https:\/\/gist.github.com\/zhangsamson\/5e04ca92810d950d97bd8f82b8f7bebd.js\"><\/script> if you check the status of your DVC repository with &#8220;dvc status&#8221;, you will be informed of changes: <script src=\"https:\/\/gist.github.com\/zhangsamson\/2f81dfe89a52b00b49ae2378873045b1.js\"><\/script><br \/>\nSo add and commit the changes :<br \/>\n<script src=\"https:\/\/gist.github.com\/zhangsamson\/cdf6a72699e4ed1ac817a9ab5383301b.js\"><\/script> 8 \u2022 Your data set registry is set up. Import\/use data from a DVC data registry with <span data-inline-card=\"true\" data-card-url=\"https:\/\/dvc.org\/doc\/command-reference\/get\"><span class=\"loader-wrapper\"><a class=\"css-1xdzogm eeajecn0\" tabindex=\"0\" role=\"button\" href=\"https:\/\/dvc.org\/doc\/command-reference\/get\" data-testid=\"inline-card-resolved-view\"><span class=\"css-1p7ax5 e158gagu2\"><span class=\"smart-link-title-wrapper css-0 e158gagu8\">get<\/span><\/span><\/a><\/span><\/span> and <span data-inline-card=\"true\" data-card-url=\"https:\/\/dvc.org\/doc\/command-reference\/import\"><span class=\"loader-wrapper\"><a class=\"css-1xdzogm eeajecn0\" tabindex=\"0\" role=\"button\" href=\"https:\/\/dvc.org\/doc\/command-reference\/import\" data-testid=\"inline-card-resolved-view\"><span class=\"css-1p7ax5 e158gagu2\"><span class=\"smart-link-title-wrapper css-0 e158gagu8\">import<\/span><\/span><\/a><\/span><\/span> commands. <script src=\"https:\/\/gist.github.com\/zhangsamson\/28a9def9907fee93883a196ee81800b2.js\"><\/script><div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'><span id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\">\u201cdvc add\u201d command does not only version files with DVC, it also adds your DVC-tracked files to .gitignore files. Make sure that you do not directly track DVC-tracked files with Git.<\/div><\/div><\/span><\/p>\n<p><div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'><span id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\">Remember, DVC only knows what files to track, to retrieve and to push based on the *.dvc files in your repository. \u201cdvc push\u201d without argument, will only check the current git commit\/branch *.dvc files and push the corresponding data version. To push all of a data set\u2019s versions tracked by different Git commits in the project\u2019s history, use \u201cdvc push -AT\u201d to check *.dvc files versions in all your commits and tags.<\/div><\/div><\/span><\/p>\n<h2 data-hook=\"rcv-block15\"><span style=\"color: #242b57; font-size: x-large;\">Configure your cache<\/span><\/h2>\n<p style=\"text-align: justify;\" data-renderer-start-pos=\"3\">The DVC cache is a content-addressable storage (by default in .dvc\/cache), which adds a layer of indirection between code and data. (cf. DVC Cache structure). It is the DVC cache that stores the data sets in the local working environment.<\/p>\n<p><strong> Cache type<\/strong><\/p>\n<p>&nbsp;<\/p>\n<div id=\"attachment_3734\" style=\"width: 345px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-3734\" class=\"wp-image-3734 \" src=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/image-20220227-182610.png\" alt=\"\" width=\"335\" height=\"379\" \/><p id=\"caption-attachment-3734\" class=\"wp-caption-text\">Figure 5. The impact of using different file link types on local storage space used with a 100Gb data set<\/p><\/div>\n<p data-renderer-start-pos=\"21184\"><span id=\"bd1bfc12-fd79-4a68-9dfe-d61d5f78b3ed\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"bd1bfc12-fd79-4a68-9dfe-d61d5f78b3ed\"><span id=\"1b494a21-e210-4ff1-8616-c0d4e7732701\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"1b494a21-e210-4ff1-8616-c0d4e7732701\">Reflink<\/span>, hardlink and symlink file linking types are particularly useful when you do not want to<span id=\"224e702e-a803-4d44-b990-b7c6ab7f63ba\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"224e702e-a803-4d44-b990-b7c6ab7f63ba\"> have the DVC project cache<\/span> in the same subdirectory as your source code or if you are lacking space in your <span id=\"5c11a30f-a68e-4fed-9c0c-5a7b5eb7d379\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"5c11a30f-a68e-4fed-9c0c-5a7b5eb7d379\">workspace partition<\/span> (cf. figure 5 and <\/span><span data-inline-card=\"true\" data-card-url=\"https:\/\/dvc.org\/doc\/user-guide\/large-dataset-optimization\"><span class=\"loader-wrapper\"><a class=\"css-1xdzogm eeajecn0\" tabindex=\"0\" role=\"button\" href=\"https:\/\/dvc.org\/doc\/user-guide\/large-dataset-optimization\" data-testid=\"inline-card-resolved-view\"><span class=\"css-1p7ax5 e158gagu2\"><span class=\"smart-link-title-wrapper css-0 e158gagu8\">Large Dataset Optimization<\/span><\/span><\/a><\/span><\/span>)<span id=\"bd1bfc12-fd79-4a68-9dfe-d61d5f78b3ed\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"bd1bfc12-fd79-4a68-9dfe-d61d5f78b3ed\">.<\/span> These file link types allow to save space. Usually, you will not work in an environment that provides reflink. These link types avoid having your DVC-tracked data set <span id=\"ea28ef3b-f24a-47ed-8cc9-85b9c6ef84a7\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"ea28ef3b-f24a-47ed-8cc9-85b9c6ef84a7\">duplicated <\/span>both in your cache and in your local git repository.<\/p>\n<p data-renderer-start-pos=\"21664\">Most of time, for large data sets (&gt;1Gb), you would want to use reflink, hardlink and symlink, although in-place edition is not available for hardlink and symlink. Depending on the task you are doing on the data set, you might want to edit the data in place, for instance manual data preprocessing or label fix. In this case, an editable link type should be preferred such as reflink or copy over hardlink and symlink.<\/p>\n<p data-renderer-start-pos=\"21664\"><div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'>Most of the time, your system will not support reflink as ext4 is the default file system for linux. Furthermore hardlink and symlink are disabled by default in DVC. More often than not, you will manually enable symlink and hardlink for efficiency.<span id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\"><\/div><\/div><\/span><\/p>\n<p data-renderer-start-pos=\"21664\"><span id=\"50625cf7-3164-401d-8c3c-8f3b69bd2ebe\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"50625cf7-3164-401d-8c3c-8f3b69bd2ebe\">You can always switch between different cache types depending on the stage of your machine learning project<\/span>.<\/p>\n<p data-renderer-start-pos=\"21664\"><img class=\"aligncenter wp-image-3781 \" src=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-15.41.23.png\" alt=\"\" width=\"569\" height=\"545\" srcset=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-15.41.23.png 793w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-15.41.23-300x287.png 300w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-15.41.23-768x735.png 768w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-15.41.23-100x96.png 100w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-15.41.23-480x459.png 480w\" sizes=\"(max-width: 569px) 100vw, 569px\" \/><\/p>\n<p data-renderer-start-pos=\"21664\">For instance If you need to manually edit your data in place, switch back to an editable link type <span data-inline-card=\"true\" data-card-url=\"https:\/\/dvc.org\/doc\/command-reference\/checkout\"><span class=\"loader-wrapper\"><a class=\"css-1xdzogm eeajecn0\" tabindex=\"0\" role=\"button\" href=\"https:\/\/dvc.org\/doc\/command-reference\/checkout\" data-testid=\"inline-card-resolved-view\"><span class=\"css-1p7ax5 e158gagu2\"><span class=\"smart-link-title-wrapper css-0 e158gagu8\">checkout<\/span><\/span><\/a><\/span><\/span><\/p>\n<p><script src=\"https:\/\/gist.github.com\/zhangsamson\/9cb5fd1ff1478865a3d3d824475bab84.js\"><\/script> When you do not need to edit your data set in place manually anymore or wish to move on to other stages of your project (training, evaluation), just switch to one of the lighter links to save space: <script src=\"https:\/\/gist.github.com\/zhangsamson\/0e809d47e37aa31bfbbe3d2a03535afc.js\"><\/script><\/p>\n<p><div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'>With hardlink and symlink, the files cannot be edited manually in place by default. If you try, it can lead to cache corruption. The whole local cache will be erased. It&#8217;s however possible to unlink before eventually editing the files or simply delete them. <span id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\"><\/div><\/div><\/span><br \/>\n<div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'>When using hardlink and symlink cache type, you should not try to edit manually DVC-tracked files. That said, files generated by dvc.yaml and dvc repro command will not corrupt the cache as DVC will automatically unlink the files before overriding them, if necessary. <span id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\"><\/div><\/div><\/span><\/p>\n<p><strong>Cache location<\/strong><\/p>\n<p data-renderer-start-pos=\"24318\">Most of time, we are running machine learning projects on our laptop or desktop with multiple physical disks and limited resources. Usually SSD for fast I\/O access where your code is and a HDD with large storage space for your data sets.<\/p>\n<p data-renderer-start-pos=\"24558\">By default, the cache will be at the root of your git repository.<\/p>\n<p data-renderer-start-pos=\"24626\">If you have configured a copy cache type for your project, your cache can take a large space of your disk. You might need to switch your cache type to a lighter one : reflink, hardlink or symlink.<\/p>\n<p data-renderer-start-pos=\"24824\">Or you can just move your cache to another partition that has more resources:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/zhangsamson\/d27c2a8b9dc31757a2037209323ce16e.js\"><\/script> <strong>About data set merge conflicts with DVC<\/strong> DVC tracks data set versions by leveraging Git\u2019s versioning functionality. When it comes to merge conflicts, DVC does not have a built-in conflict resolution capability, so DVC also relies on git for conflict resolution. As we know, we only track *.dvc files with Git which means when there are merge conflicts, only meta files are compared which is often insufficient as the data sets versions are not compared. An example of *.dvc file conflict when merging : <script src=\"https:\/\/gist.github.com\/zhangsamson\/cbd46c8a9fc7db8c5784e0c8d466a7a4.js\"><\/script><\/p>\n<p><strong> The 3 situations we will commonly face<\/strong><br \/>\nIn order to illustrate, let\u2019s say that 2 people P1 and P2 are working together on a machine learning project with an image data set. P1 works on branch B1 and P2 works on branch B2. The initial data set data\/ (tracked by data.dvc) they are working on is the D1 version:<\/p>\n<p>\u2022 First situation: only one of P1 and P2 modifies the data set. P1 modifies the data set and creates a version D2 of the data set on branch B1. P2 finishes working on a new feature and he did not modify the data set D1. P2 needs to merge B1 into B2 and resolve the conflict on data set versions difference. As only one of the branches modified the original D1 data set, P2 can just replace its version of data.dvc (in branch B2) by branch B1\u2019s version.<\/p>\n<p>\u2022 Second situation: both P1 and P2 only add non-overlapping images to the data set. P1 only adds new images to the data set D1 and creates D2 on B1. P2 also only adds new images to the data set D1 and creates D3 on B2. Furthermore, the image subsets they both added are disjoint. In this case, the merger can use Git drive merger DVC: Merge conflicts, append-only data set<\/p>\n<p>\u2022 Third situation: both P1 and P2 modifie the data set (removal, addition, modification). P1 modifies the data set and creates a version D2 of the data set in branch B1. P2 also modifies the data set D1 and creates a version D3 in the branch B2. P2 needs to merge B1 into B2 and resolves the conflict on data set versions difference between D2 and D3. Here no assumption is made about the type of modifications to the data set, there can be removals, additions and modifications to any file of the data set. If you want to actually merge all the modifications in both branches, this is the trickiest situation. Neither git nor DVC can directly help. You have to manually merge the data sets.<\/p>\n<p><div class='et-box et-info'>\n\t\t\t\t\t<div class='et-box-content'>DVC is not designed to resolve data version conflicts when merging 2 branches that both make removal or modification operations. For complex tasks such as data set cleansing and labelling with multiple people updating the data set at the same time, I highly recommend <strong data-renderer-mark=\"true\">not to use<\/strong> DVC\/Git and use <span id=\"6dd9cb4b-72d5-4c47-b212-2d77537a75af\" class=\"inline-highlight\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"6dd9cb4b-72d5-4c47-b212-2d77537a75af\">specialized third-party tools instead.<\/span><span id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"f0d2d88f-ec5a-412d-8b3e-0f02e5e12ec8\"><\/div><\/div><\/span><\/p>\n<p><strong>About hyper-parameters tracking with DVC<\/strong><\/p>\n<p data-renderer-start-pos=\"27939\">DVC can also track metric files and hyperparameters, but MLflow is more suited to do so. For instance, <span id=\"7119dd7b-12bb-4586-9655-1b4ceb767e94\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"7119dd7b-12bb-4586-9655-1b4ceb767e94\">DVC can do <\/span><span id=\"e1a2263e-a03a-4273-8489-0b8a5c2c0c1b\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"e1a2263e-a03a-4273-8489-0b8a5c2c0c1b\"><span id=\"7119dd7b-12bb-4586-9655-1b4ceb767e94\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"7119dd7b-12bb-4586-9655-1b4ceb767e94\">experiment results<\/span><\/span><span id=\"7119dd7b-12bb-4586-9655-1b4ceb767e94\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"7119dd7b-12bb-4586-9655-1b4ceb767e94\"> tracking by commiting metrics files (to DVC and git repo) and compare different versions (different commits) of a metric file using <\/span><a class=\"css-bspq7p\" title=\"https:\/\/dvc.org\/doc\/studio\/overview\" href=\"https:\/\/dvc.org\/doc\/studio\/overview\" data-renderer-mark=\"true\"><span id=\"7119dd7b-12bb-4586-9655-1b4ceb767e94\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"7119dd7b-12bb-4586-9655-1b4ceb767e94\">DVC Studio<\/span><\/a><span id=\"7119dd7b-12bb-4586-9655-1b4ceb767e94\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"7119dd7b-12bb-4586-9655-1b4ceb767e94\">.<\/span> Usually one might want to reduce the number of tools used as much as possible not want to duplicate the results with multiple tools such as Git repository (DVC) and in a remote server (Mlflow).<\/p>\n<p data-renderer-start-pos=\"28413\">Furthermore, DVC and Mlflow have different approaches concerning metrics versioning:<\/p>\n<ul>\n<li data-renderer-start-pos=\"28501\">DVC tracks experiment metrics with a commit after model training\/results generation;<\/li>\n<li data-renderer-start-pos=\"28501\">Mlflow tracks experiment results using the present commit used for training the models.<\/li>\n<\/ul>\n<p data-renderer-start-pos=\"28680\"><span id=\"df87e081-c663-4bbd-ac3e-04003b9c6a2c\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"df87e081-c663-4bbd-ac3e-04003b9c6a2c\">DVC\u2019s approach is lighter than <\/span>Mlflow\u2019s for metrics tracking as DVC saves <span id=\"56efae73-4dd6-44ef-91f5-f6462b6aba12\" data-renderer-mark=\"true\" data-mark-type=\"annotation\" data-mark-annotation-type=\"inlineComment\" data-id=\"56efae73-4dd6-44ef-91f5-f6462b6aba12\">metric<\/span>s files directly to the Git repo unlike Mlflow that saves metrics to a remote server. That being said, both approaches are conflicting when used at the same time for metrics tracking. It would mean that Mlflow tracks an experiment at commit N and the results would be tracked by DVC in commit N+1 which is not practical. Mlflow has the advantage of having an auto-logging functionality that makes metrics tracking effortless and transparent (in addition to logging artifacts among other features). With DVC, one would need to manually handle metrics logging to files which is inconvenient.<\/p>\n<p data-renderer-start-pos=\"29352\">For the sake of simplicity and because of this series of articles is focused on experiment tracking using DVC and Mlflow, I would recommend using Mflow for metrics tracking over DVC. However, one should know that it is technically possible to track experiment metrics with DVC (with some effort).<\/p>\n<h2 data-hook=\"rcv-block15\"><span style=\"color: #242b57; font-size: x-large;\">Conclusion<\/span><\/h2>\n<p data-renderer-start-pos=\"29662\">DVC is a lightweight file versioning tool built on top of Git versioning capabilities designed for versioning data sets. It has an optimized cache system that avoids file duplication between different data set versions. Using a third-party tool like DVC allows to decouple raw data sets used for training machine learning models from the code by commiting small metafiles that describe the data sets tracked by a Git repository. DVC can also be used for data preprocessing pipeline. Its data set registry functionality is particularly useful for managing data sets sharing between different data science projects.<\/p>\n<p data-renderer-start-pos=\"30277\">A summary of DVC features, pros &amp; cons:<\/p>\n<p><img class=\"aligncenter wp-image-3749 \" src=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-14.30.18.png\" alt=\"\" width=\"811\" height=\"450\" srcset=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-14.30.18.png 1974w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-14.30.18-300x167.png 300w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-14.30.18-1024x569.png 1024w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-14.30.18-768x426.png 768w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-14.30.18-1536x853.png 1536w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-14.30.18-100x56.png 100w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-14.30.18-1080x600.png 1080w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-14.30.18-1280x711.png 1280w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-14.30.18-980x544.png 980w, https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/Capture-de\u0301cran-2022-05-25-a\u0300-14.30.18-480x267.png 480w\" sizes=\"(max-width: 811px) 100vw, 811px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p data-renderer-start-pos=\"14266\"><strong>This article belongs to a series of articles about MLOps tools and practices for data and model experiment tracking. Four articles are published : <\/strong><\/p>\n<p>PART 1 (<a href=\"https:\/\/dev.littlebigcode.fr\/mlops-why-data-model-experiment-tracking-is-important\/\">Click here<\/a>) : Introduction to data &amp; model experiment tracking<\/p>\n<p>PART 2 (this article) : MLOps: How DVC smartly manages your data sets for training your machine learning models on top of Git ?<\/p>\n<p>PART 3 (soon available) : MLOps: How MLflow effortlessly tracks your experiments and helps you compare them ?<\/p>\n<p>PART 4 (soon available) : Use case: Effortlessly track your model experiments with DVC and MLflow<\/p>\n<p><strong>Feel free to jump to other articles if you are already familiar with the concepts !<\/strong><br \/>\n[\/et_pb_text][\/et_pb_column][\/et_pb_row][et_pb_row _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_button button_url=&#8221;https:\/\/dev.littlebigcode.fr\/ressources\/#blog&#8221; url_new_window=&#8221;on&#8221; button_text=&#8221;Tous nos articles&#8221; button_alignment=&#8221;center&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; button_text_size=&#8221;15px&#8221; button_text_color=&#8221;#242B57&#8243; button_bg_color=&#8221;#FFFFFF&#8221; button_font=&#8221;Century Gothic Bold|700|||||||&#8221; button_use_icon=&#8221;on&#8221; button_icon=&#8221;&#xe035;||divi||400&#8243; button_icon_color=&#8221;#FCC002&#8243; button_on_hover=&#8221;off&#8221; global_colors_info=&#8221;{}&#8221; button_bg_color__hover=&#8221;#242B57&#8243; button_border_color__hover=&#8221;#242B57&#8243;][\/et_pb_button][\/et_pb_column][\/et_pb_row][\/et_pb_section]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>What are we talking about? DVC is a MLOps tool that works on top of Git repositories and has a similar command line interface and workflow to Git. It is designed to tackle the challenge of data sets traceability and reproducibility when training data-driven models.<\/p>\n","protected":false},"author":12,"featured_media":3885,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"on","_et_pb_old_content":"<p><strong>Entra\u00eenement, coaching, sant\u00e9 physique et mentale, discipline, effort, pers\u00e9v\u00e9rance, endurance, confiance\u2026 Si ces termes sont tr\u00e8s souvent et \u00ab naturellement \u00bb rattach\u00e9s au sport de haut niveau, ils le sont beaucoup moins \u00e0 l\u2019entrepreneuriat. Pourtant, les entrepreneur.es doivent \u00e9galement s\u2019astreindre \u00e0 un programme complet pour s\u2019imposer, performer et durer. Et si l\u2019entrepreneuriat \u00e9tait lui aussi une discipline de haut niveau ?<\/strong> Ne dit-on pas que le talent n'attend pas le nombre des ann\u00e9es ? Natation, tennis, football, formule 1\u2026 Le sport regorge d\u2019exemples confirmant cet adage. Les sportifs et sportives arrivent d\u00e9sormais \u00e0 maturit\u00e9 de plus en plus t\u00f4t et il n\u2019est pas rare de voir des jeunes d\u2019\u00e0 peine 18 ans rivaliser avec leurs a\u00een\u00e9s et m\u00eame triompher. Il faut se faire une raison\u00a0: l\u2019exp\u00e9rience n\u2019est plus un axe diff\u00e9renciateur\u00a0! La \u00ab\u00a0faute\u00a0\u00bb \u00e0 la science et \u00e0 la technologie qui ont permis aux \u00ab\u00a0juniors\u00a0\u00bb d\u2019atteindre leur maturit\u00e9 sportive beaucoup plus rapidement, permettant alors de compenser le manque d\u2019exp\u00e9rience.<\/p><h2>Du sport \u00e0 l\u2019entrepreneuriat, il n\u2019y a souvent qu\u2019un pas<\/h2><p>Et l\u2019inverse est aussi vrai. Ainsi, il n\u2019est pas rare de voir des personnes plus exp\u00e9riment\u00e9es adopter les usages attribu\u00e9s habituellement aux digital natives, comme les r\u00e9seaux sociaux.<\/p><blockquote><p><em>Voil\u00e0 pourquoi, apr\u00e8s plusieurs ann\u00e9es de pr\u00e9sence sur LinkedIn, j\u2019ai enfin d\u00e9cid\u00e9 de me lancer dans la r\u00e9daction de mon premier article\u00a0! Il n\u2019est jamais trop tard\u2026<\/em><\/p><\/blockquote><p>Mais encore fallait-il trouver un sujet sur lequel je me sentais l\u00e9gitime et qui n\u2019avait pas ou peu \u00e9t\u00e9 trait\u00e9. J\u2019ai donc d\u00e9cid\u00e9 d\u2019aborder les similarit\u00e9s entre mon pass\u00e9 de sportif de haut niveau et mon \u00ab\u00a0job d\u2019entrepreneur\u00a0\u00bb avec un focus tout particulier sur les m\u00e9thodes que j\u2019ai \u00ab\u00a0transpos\u00e9es\u00a0\u00bb entre les deux activit\u00e9s.\u00a0<strong>L\u2019objectif est de partager mon retour d\u2019exp\u00e9rience et, je l\u2019esp\u00e8re, peut-\u00eatre de pouvoir aider de entrepreneur.es dans leur parcours.<\/strong> Cela fait maintenant plus de dix ans que je me suis lanc\u00e9 dans ma premi\u00e8re aventure entrepreneuriale avec la cr\u00e9ation d\u2019une plateforme de VTC lanc\u00e9e en parall\u00e8le de mon premier emploi. Ont suivi deux autres exp\u00e9riences, avec plus ou moins de r\u00e9ussite, mais toujours cette envie d\u2019\u00eatre le plus efficace possible, de durer dans l\u2019exercice malgr\u00e9 les ann\u00e9es et, surtout, de prendre du plaisir au quotidien dans mon job. Or dans le sport, impossible d\u2019\u00eatre efficace, de durer et de prendre du plaisir sans\u2026 entra\u00eenement\u00a0et sans une certaine hygi\u00e8ne de vie\u00a0! <strong>Apr\u00e8s le sport de haut niveau, voici le \u00ab\u00a0job\u00a0\u00bb de haut niveau dont l\u2019entrepreneuriat serait une discipline,<\/strong>\u00a0\u00e0 l\u2019instar du foot, du ski ou encore du cyclisme que je connais bien. Voil\u00e0 pourquoi, en tant qu\u2019entrepreneur, il me paraissait \u00e9vident, pour performer, d\u2019appliquer les fondamentaux que j\u2019avais appris dans le sport tout en les adaptant, les am\u00e9liorant et en \u00e9tudiant sans cesse de nouvelles approches en vue de tendre vers l\u2019am\u00e9lioration continue.<\/p><h2>Le code de l\u2019entrepreneur de haut niveau<\/h2><div data-hook=\"rcv-block15\"><p>Certes, il y a et il y aura toujours des personnes plus aptes, plus efficaces, plus matures ou encore plus intelligentes au m\u00eame \u00e2ge par rapport \u00e0 d\u2019autres. N\u00e9anmoins, je reste convaincu que, pour durer, le talent ne suffit pas et que le travail finit toujours par payer\u00a0!<\/p><blockquote><p>Comme dans le sport, l\u2019entrepreneur.e doit \u00e9galement se fixer des objectifs et analyser sa courbe de progression.\u00a0<strong>Ainsi\u00a0:<\/strong><\/p><\/blockquote><ul><li>Tout objectif doit \u00eatre mesurable et atteignable<\/li><li>Pour \u00e9valuer une situation, une progression ou autre, il doit mettre en place des KPI et donc des outils de mesure<\/li><li>Faire son bilan (semestriellement ou annuellement)<\/li><li>Avant d\u2019\u00eatre efficient, il faut \u00eatre efficace, c\u2019est-\u00e0-dire faire d\u2019abord les bonnes actions avant de les faire bien\u00a0!<\/li><\/ul><blockquote><p>\u00c0 l'image des basiques du sport de haut niveau, l'entrepreneuriat repose sur 4 \u00e9lements cl\u00e9s :<\/p><\/blockquote><ol><li>La pr\u00e9paration<\/li><li>L\u2019hygi\u00e8ne de vie<\/li><li>Le mental<\/li><li>L\u2019entourage<\/li><\/ol><h2>1\/Entra\u00eenez-vous<\/h2><p>C\u2019est certainement, le point qui semble le moins pertinent \u00e0 dupliquer lorsque l\u2019on est entrepreneur.e\u2026 Et pourtant \u00e7a n\u2019est pas si difficile. En effet, pour am\u00e9liorer vos performances sportives, vous devez vous entra\u00eener. Alors, pourquoi ne pas le faire dans votre profession\u00a0? En r\u00e9alit\u00e9, nous le faisons mais pas forc\u00e9ment sous la forme que l\u2019on imagine avec le sport.\u00a0<strong>Petit rappel concernant l\u2019entra\u00eenement qui se caract\u00e9rise par le fait de :<\/strong><\/p><ul><li>Habituer son corps \u00e0 certains efforts<\/li><li>Acqu\u00e9rir de nouvelles m\u00e9thodes et des automatismes<\/li><li>Bousculer son organisme pour l\u2019obliger \u00e0 progresser<\/li><\/ul><p>De m\u00eame, la pr\u00e9paration, pour s\u2019av\u00e9rer efficace, doit \u00eatre adapt\u00e9e en fonction\u00a0du sport, de votre \u00e9tat de forme et de vos objectifs. De ce fait, on comprend bien que pour un.e entrepreneur.e et\/ou chef.fe d\u2019entreprise, nous n\u2019attendons pas le m\u00eame type d\u2019exercice que pour un cycliste\u2026 \u00c0 titre d\u2019exemple, les d\u00e9veloppeurs ont compris que pour progresser, ils devaient effectuer une veille technologique, r\u00e9aliser des projets personnels, suivre des tutoriels, etc.\u00a0<strong>En bref\u00a0: s\u2019entra\u00eener\u00a0!<\/strong> Ainsi, pour faire \u00e9voluer sa soci\u00e9t\u00e9, un.e dirigeant.e d\u2019entreprise doit aussi \u00e9voluer en \u00e9largissant son champ de comp\u00e9tences, en particulier sur les domaines suivants :<\/p><ul><li>Le management<\/li><li>La communication<\/li><li>La strat\u00e9gie<\/li><li>La gestion<\/li><li>Le business<\/li><li>Les RH<\/li><li>L\u2019organisation\u2026<\/li><\/ul><p>Pour y parvenir, il faut alors se sensibiliser, se documenter et se former.\u00a0<strong>En un mot l\u00e0 encore\u00a0: s\u2019entra\u00eener\u00a0!<\/strong> Si cette prise de conscience est cl\u00e9 dans la vie d\u2019un.e entrepreneur.e, une fois que l\u2019on a compris l\u2019importance de ce point, que faire exactement\u00a0? Comment savoir quel type de\u00a0<em>training<\/em>\u00a0privil\u00e9gier et avec quel timing\u00a0? Tout simplement en appliquant une nouvelle fois les principes du sport, \u00e0 savoir\u00a0: 1.\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<strong>D\u00e9finir son \u00e9tat de forme<\/strong>\u00a0avec le fameux test d\u2019effort effectu\u00e9 une \u00e0 deux fois par an par les sportifs de haut niveau. Pour les entrepreneur.es, nous parlons alors de bilan ou d\u2019auto-\u00e9valuation nous permettant de conna\u00eetre pr\u00e9cis\u00e9ment les niveaux de comp\u00e9tences sur chacun des points cit\u00e9s plus haut (liste non exhaustive). 2.\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<strong>D\u00e9finir les objectifs<\/strong>\u00a0en termes de\u00a0performances (donn\u00e9es physiologiques) et de r\u00e9sultats (course). En effet, l\u2019objectif est d\u2019abord d\u2019am\u00e9liorer ses capacit\u00e9s pour viser de meilleurs r\u00e9sultats. Pour un.e dirigeant.e, nous parlons ici d\u2019objectifs personnels et d\u2019objectifs de soci\u00e9t\u00e9. 3.\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<strong>D\u00e9finir un plan d\u2019entra\u00eenement<\/strong>\u00a0associ\u00e9, ce qui \u00e9quivaut \u00e0 l\u2019agenda, \u00e0 la roadmap et\/ou au planning de formation pour un.e chef.fe d\u2019entreprise.<\/p><h3><strong>Quelques tips\u00a0<\/strong><\/h3><p>Concernant son \u00ab auto-\u00e9valuation \u00bb, il convient de construire une matrice d\u2019\u00e9valuation. Sur ce point, internet regorge de litt\u00e9rature et de documentations mais \u00e0 chacun de construire sa matrice et d\u2019apprendre \u00e0 s\u2019auto-\u00e9valuer, et\/ou de se r\u00e9f\u00e9rer \u00e0 une autre personne qui nous conna\u00eet bien (un associ\u00e9 par exemple). Le but est alors d\u2019obtenir une liste de crit\u00e8res (objectifs et\/ou subjectifs) en lien avec notre projet professionnel et personnel, et que l\u2019on pourra \u00e9valuer et comparer d\u2019une ann\u00e9e sur l\u2019autre. <strong>Cette \u00e9valuation, une fois effectu\u00e9e, permet de d\u00e9finir votre profil et de mettre en avant\u00a0:<\/strong><\/p><ul><li>Vos points forts<\/li><li>Vos points d\u2019am\u00e9lioration<\/li><li>Votre \u00e9volution par rapport \u00e0 l\u2019ann\u00e9e pr\u00e9c\u00e9dente<\/li><li>Et surtout de vous aider \u00e0 d\u00e9finir nos objectifs<\/li><\/ul><p>Une fois votre profil et vos points forts identifi\u00e9s, il est temps de d\u00e9finir vos objectifs\u00a0! Tout en vous montrant ambitieux, vous devez rester humble et en phase avec vos capacit\u00e9s afin qu\u2019ils soient atteignables et vous mettent en confiance. Id\u00e9alement, on d\u00e9finit des objectifs par trimestre et par ann\u00e9e. Ensuite, vous devez d\u00e9finir les m\u00e9thodes et les moyens n\u00e9cessaires pour y arriver. Mais est-ce possible\u00a0? Si non, quels sont les autres moyens\u00a0?<\/p><h3><strong>Quelques exemples d'objectifs<\/strong><\/h3><p><strong>C\u00f4t\u00e9 sportif, voici les objectifs qu\u2019il est possible de se fixer\u00a0:<\/strong><\/p><ul><li>Dans 6 mois, je souhaite courir le semi-marathon en 1h40<\/li><li>\u00a0Dans 1 an, je le cours en 1h30<\/li><li>Dans 3 ans,\u00a0en 1h20<\/li><\/ul><p>Pour y arriver, je vais devoir augmenter ma \u00ab\u00a0Vitesse Maximale A\u00e9robie VMA\u00a0\u00bb, c\u2019est-\u00e0-dire la vitesse de course \u00e0 laquelle j\u2019atteins ma consommation maximale d'oxyg\u00e8ne. Et pour y parvenir, je dois donc effectuer tel type d\u2019entra\u00eenement, tant de fois par semaine. <strong>Pour un.e entrepreneur.e, les objectifs peuvent ressembler \u00e0 ceux-ci\u00a0:<\/strong><\/p><ul><li>Dans 3 mois, je souhaite livrer la V1 de mon application<\/li><li>Dans 6 mois, je veux g\u00e9n\u00e9rer 1 M\u20ac de CA<\/li><li>Dans 1 an, je dois livrer la V2 et atteindre les 2 M\u20ac de CA<\/li><\/ul><p>Pour atteindre ce but je vais donc devoir recruter X collaborateurs et g\u00e9n\u00e9rer plus de leads business. Ainsi, il me faut renforcer mes \u00e9quipes RH et commerciales tout en augmentant ma visibilit\u00e9 gr\u00e2ce au marketing. En ai-je les moyens\u00a0? Non\u00a0! C\u2019est pourquoi, je vais devoir m\u2019atteler \u00e0 recruter et \u00e0 d\u00e9velopper le business moi-m\u00eame, etc. Pour conclure, vous d\u00e9finissez donc vous aussi votre plan d\u2019entra\u00eenement et\/ou votre agenda. Ce dernier peut-\u00eatre un planning hebdomadaire fig\u00e9 avec des cr\u00e9neaux pr\u00e9vus pour faire face aux impr\u00e9vus, associ\u00e9 \u00e0 un agenda plus macro avec les grandes \u00e9tapes.<\/p><\/div><div data-hook=\"rcv-block15\">\u00a0<\/div><div id=\"viewer-e19n7\" class=\"XzvDs _208Ie _1atvN _2QAo- _25MYV _2WrB- _1atvN public-DraftStyleDefault-block-depth0 public-DraftStyleDefault-text-ltr\">\u00a0<\/div>","_et_gb_content_width":""},"categories":[38],"tags":[46],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v19.7.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>MLOps : How DVC manages data sets for training your ML models on top of Git<\/title>\n<meta name=\"description\" content=\"What are we talking about ? DVC is a MLOps tool that works on top of Git repositories and has a similar command line interface and workflow to Git. It is designed to tackle the challenge of data sets traceability and reproducibility when training data-driven models.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"MLOps : How DVC manages data sets for training your ML models on top of Git\" \/>\n<meta property=\"og:description\" content=\"What are we talking about ? DVC is a MLOps tool that works on top of Git repositories and has a similar command line interface and workflow to Git. It is designed to tackle the challenge of data sets traceability and reproducibility when training data-driven models.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/\" \/>\n<meta property=\"og:site_name\" content=\"LittleBigCode.fr\" \/>\n<meta property=\"article:published_time\" content=\"2022-05-25T11:04:35+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-07-04T21:27:39+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/1_8ZPoGDm8Oq172E8rlcriAA.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1400\" \/>\n\t<meta property=\"og:image:height\" content=\"787\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Samson ZHANG\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/\"},\"author\":{\"name\":\"Samson ZHANG\",\"@id\":\"https:\/\/dev.littlebigcode.fr\/#\/schema\/person\/4a7d3bfd7b4a0911435cbf144c1acca9\"},\"headline\":\"MLOps : How DVC smartly manages your data sets for training your machine learning models on top of Git\",\"datePublished\":\"2022-05-25T11:04:35+00:00\",\"dateModified\":\"2022-07-04T21:27:39+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/\"},\"wordCount\":4542,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/dev.littlebigcode.fr\/#organization\"},\"keywords\":[\"datascience\"],\"articleSection\":[\"Consulting article\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/\",\"url\":\"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/\",\"name\":\"MLOps : How DVC manages data sets for training your ML models on top of Git\",\"isPartOf\":{\"@id\":\"https:\/\/dev.littlebigcode.fr\/#website\"},\"datePublished\":\"2022-05-25T11:04:35+00:00\",\"dateModified\":\"2022-07-04T21:27:39+00:00\",\"description\":\"What are we talking about ? DVC is a MLOps tool that works on top of Git repositories and has a similar command line interface and workflow to Git. It is designed to tackle the challenge of data sets traceability and reproducibility when training data-driven models.\",\"breadcrumb\":{\"@id\":\"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Accueil\",\"item\":\"https:\/\/dev.littlebigcode.fr\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"MLOps : How DVC smartly manages your data sets for training your machine learning models on top of Git\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/dev.littlebigcode.fr\/#website\",\"url\":\"https:\/\/dev.littlebigcode.fr\/\",\"name\":\"LittleBigCode.fr\",\"description\":\"AI Solution Creator\",\"publisher\":{\"@id\":\"https:\/\/dev.littlebigcode.fr\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/dev.littlebigcode.fr\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/dev.littlebigcode.fr\/#organization\",\"name\":\"LittleBigCode\",\"url\":\"https:\/\/dev.littlebigcode.fr\/\",\"sameAs\":[\"https:\/\/www.linkedin.com\/company\/littlebigcode\/\",\"https:\/\/www.youtube.com\/channel\/UCTEax-7nR6n2zzgL4bz3fWQ\",\"https:\/\/medium.com\/hub-by-littlebigcode\"],\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/dev.littlebigcode.fr\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2021\/08\/Logo-LBC-AISC-format-carre\u0301.png\",\"contentUrl\":\"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2021\/08\/Logo-LBC-AISC-format-carre\u0301.png\",\"width\":768,\"height\":768,\"caption\":\"LittleBigCode\"},\"image\":{\"@id\":\"https:\/\/dev.littlebigcode.fr\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/dev.littlebigcode.fr\/#\/schema\/person\/4a7d3bfd7b4a0911435cbf144c1acca9\",\"name\":\"Samson ZHANG\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/dev.littlebigcode.fr\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/3ccd39671e91a53fb5ea812c4c864941?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/3ccd39671e91a53fb5ea812c4c864941?s=96&d=mm&r=g\",\"caption\":\"Samson ZHANG\"},\"url\":\"https:\/\/dev.littlebigcode.fr\/en\/author\/szhanglittlebigcode-fr\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"MLOps : How DVC manages data sets for training your ML models on top of Git","description":"What are we talking about ? DVC is a MLOps tool that works on top of Git repositories and has a similar command line interface and workflow to Git. It is designed to tackle the challenge of data sets traceability and reproducibility when training data-driven models.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/","og_locale":"en_US","og_type":"article","og_title":"MLOps : How DVC manages data sets for training your ML models on top of Git","og_description":"What are we talking about ? DVC is a MLOps tool that works on top of Git repositories and has a similar command line interface and workflow to Git. It is designed to tackle the challenge of data sets traceability and reproducibility when training data-driven models.","og_url":"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/","og_site_name":"LittleBigCode.fr","article_published_time":"2022-05-25T11:04:35+00:00","article_modified_time":"2022-07-04T21:27:39+00:00","og_image":[{"width":1400,"height":787,"url":"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2022\/05\/1_8ZPoGDm8Oq172E8rlcriAA.png","type":"image\/png"}],"author":"Samson ZHANG","twitter_card":"summary_large_image","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/#article","isPartOf":{"@id":"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/"},"author":{"name":"Samson ZHANG","@id":"https:\/\/dev.littlebigcode.fr\/#\/schema\/person\/4a7d3bfd7b4a0911435cbf144c1acca9"},"headline":"MLOps : How DVC smartly manages your data sets for training your machine learning models on top of Git","datePublished":"2022-05-25T11:04:35+00:00","dateModified":"2022-07-04T21:27:39+00:00","mainEntityOfPage":{"@id":"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/"},"wordCount":4542,"commentCount":0,"publisher":{"@id":"https:\/\/dev.littlebigcode.fr\/#organization"},"keywords":["datascience"],"articleSection":["Consulting article"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/","url":"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/","name":"MLOps : How DVC manages data sets for training your ML models on top of Git","isPartOf":{"@id":"https:\/\/dev.littlebigcode.fr\/#website"},"datePublished":"2022-05-25T11:04:35+00:00","dateModified":"2022-07-04T21:27:39+00:00","description":"What are we talking about ? DVC is a MLOps tool that works on top of Git repositories and has a similar command line interface and workflow to Git. It is designed to tackle the challenge of data sets traceability and reproducibility when training data-driven models.","breadcrumb":{"@id":"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/dev.littlebigcode.fr\/en\/how-dvc-manages-data-sets-training-ml-models-git\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Accueil","item":"https:\/\/dev.littlebigcode.fr\/en\/"},{"@type":"ListItem","position":2,"name":"MLOps : How DVC smartly manages your data sets for training your machine learning models on top of Git"}]},{"@type":"WebSite","@id":"https:\/\/dev.littlebigcode.fr\/#website","url":"https:\/\/dev.littlebigcode.fr\/","name":"LittleBigCode.fr","description":"AI Solution Creator","publisher":{"@id":"https:\/\/dev.littlebigcode.fr\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/dev.littlebigcode.fr\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/dev.littlebigcode.fr\/#organization","name":"LittleBigCode","url":"https:\/\/dev.littlebigcode.fr\/","sameAs":["https:\/\/www.linkedin.com\/company\/littlebigcode\/","https:\/\/www.youtube.com\/channel\/UCTEax-7nR6n2zzgL4bz3fWQ","https:\/\/medium.com\/hub-by-littlebigcode"],"logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/dev.littlebigcode.fr\/#\/schema\/logo\/image\/","url":"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2021\/08\/Logo-LBC-AISC-format-carre\u0301.png","contentUrl":"https:\/\/dev.littlebigcode.fr\/wp-content\/uploads\/2021\/08\/Logo-LBC-AISC-format-carre\u0301.png","width":768,"height":768,"caption":"LittleBigCode"},"image":{"@id":"https:\/\/dev.littlebigcode.fr\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/dev.littlebigcode.fr\/#\/schema\/person\/4a7d3bfd7b4a0911435cbf144c1acca9","name":"Samson ZHANG","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/dev.littlebigcode.fr\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/3ccd39671e91a53fb5ea812c4c864941?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/3ccd39671e91a53fb5ea812c4c864941?s=96&d=mm&r=g","caption":"Samson ZHANG"},"url":"https:\/\/dev.littlebigcode.fr\/en\/author\/szhanglittlebigcode-fr\/"}]}},"_links":{"self":[{"href":"https:\/\/dev.littlebigcode.fr\/en\/wp-json\/wp\/v2\/posts\/3886"}],"collection":[{"href":"https:\/\/dev.littlebigcode.fr\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dev.littlebigcode.fr\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dev.littlebigcode.fr\/en\/wp-json\/wp\/v2\/users\/12"}],"replies":[{"embeddable":true,"href":"https:\/\/dev.littlebigcode.fr\/en\/wp-json\/wp\/v2\/comments?post=3886"}],"version-history":[{"count":0,"href":"https:\/\/dev.littlebigcode.fr\/en\/wp-json\/wp\/v2\/posts\/3886\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dev.littlebigcode.fr\/en\/wp-json\/wp\/v2\/media\/3885"}],"wp:attachment":[{"href":"https:\/\/dev.littlebigcode.fr\/en\/wp-json\/wp\/v2\/media?parent=3886"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dev.littlebigcode.fr\/en\/wp-json\/wp\/v2\/categories?post=3886"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dev.littlebigcode.fr\/en\/wp-json\/wp\/v2\/tags?post=3886"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}