How to create a reproducible pipeline with Moja Global Dataset.#
To add datasets to the Land sector repository, we start by cloning this repository. Check out this article to understand how to clone a repository on GitHub.
Next, we go through the library and find the raw dataset that we want to
process, in this case, the HarmonizedWorldSoilDatabase
dataset. It
is important to note that the HarmonizedWorldSoilDatabase
is a test
dataset. Each dataset will have specific transformation steps to prepare
analysis, but each step can be run using our DVC pipelines. Each
pipeline will follow the broad steps of download, extract and transform,
then save the results on the remote storage.
In the Land_Sector_Datasets
repository we just cloned, navigate to
the bilToTif.py
file, which is nested in the Data/Soil
directory.
This Data/Soil/bilToTif.py
file contains a BILtoTIF
function
that takes the raw HarmonizedWorldSoilDatabase
dataset as a
parameter and then converts the dataset from a .bil
file to a
.tif
file. The BILtoTIF
function then saves the .tif
file.
def BILtoTIF(inBilPath,outTifPath):
inBil = gdal.Open(inBilPath)
driver = gdal.GetDriverByName('GTiff')
outTif = driver.CreateCopy(outTifPath, inBil, 0)
inBil = None
outTif = None
Next, in the Data/Soil/bilToTif.py
file, we have a
restructureTIF
function that restructures the .tif
dataset using
the options specified in the restructureTIF
.
def restructureTIF(in_tif, out_tif):
options = gdal.WarpOptions(
creationOptions=["COMPRESS=DEFLATE","PREDICTOR=2","ZLEVEL=9"],
dstSRS="EPSG:4326",
format="GTiff",
multithread=True,
xRes=0.005, yRes=0.005,
resampleAlg="near",
)
gdal.Warp(destNameOrDestDS=out_tif, srcDSOrSrcDSTab=in_tif, options=options)
if __name__ == "__main__":
in_src = os.path.join("HWSD_RASTER","hwsd.bil")
out_src = os.path.join("HWSD_VECTOR","HarmonizedWorldSoilDatabase_RAW.tif")
BILtoTIF(in_src,out_src)
if os.path.isfile(out_src):
restructuredTifPath = os.path.join("HWSD_VECTOR","HarmonizedWorldSoilDatabase_RESTRUCTURED.tif")
restructureTIF(in_tif=out_src, out_tif=restructuredTifPath)
We can see the different options specified in the code block above.
The project’s root directory has a dvc.yaml
file. This dvc.yaml
file contains an established DVC pipeline.
vars:
- python: C:\Develop\anaconda\envs\gdal\python.exe
stages:
HarmonizedWorldSoilDatabase:
cmd: python bilToTif.py
wdir: Data/Soil
deps:
- HWSD_RASTER\hwsd.bil
- bilToTif.py
outs:
- HWSD_VECTOR\HarmonizedWorldSoilDatabase_RAW.tif
- HWSD_VECTOR\HarmonizedWorldSoilDatabase_RESTRUCTURED.tif
The DVC pipeline does the following:
* Runs the bilToTif
script
* Pushes the processed data to a remote storage
In the root directory, there is a .github
folder that contains a
workflows
folder. This .github/workflows
directory contains a
health-check.yaml
file.
The github/workflows/health-check.yaml
file holds the GitHub Actions
responsible for recreating this DVC pipeline.
name: dvc-report
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.x'
- uses: iterative/setup-dvc@v1
- uses: iterative/setup-cml@v1
- name: add dvc
env:
repo_token: ${{ secrets.GITHUB_TOKEN }}
run: |
echo "# DVC REPORT" > report.md
echo "## Files and Directories currently tracked" >> report.md
dvc list -R --dvc-only . >> report.md
cml-send-comment report.md
It is important to note that GitHub Actions only runs when there is a change in the python script, producing a different data output.