How to create a reproducible pipeline with Moja Global Dataset.#
To add datasets to the Land sector repository, we start by cloning this repository. Check out this article to understand how to clone a repository on GitHub.
Next, we go through the library and find the raw dataset that we want to
process, in this case, the HarmonizedWorldSoilDatabase dataset. It
is important to note that the HarmonizedWorldSoilDatabase is a test
dataset. Each dataset will have specific transformation steps to prepare
analysis, but each step can be run using our DVC pipelines. Each
pipeline will follow the broad steps of download, extract and transform,
then save the results on the remote storage.
In the Land_Sector_Datasets repository we just cloned, navigate to
the bilToTif.py file, which is nested in the Data/Soil
directory.
This Data/Soil/bilToTif.py file contains a BILtoTIF function
that takes the raw HarmonizedWorldSoilDatabase dataset as a
parameter and then converts the dataset from a .bil file to a
.tif file. The BILtoTIF function then saves the .tif file.
def BILtoTIF(inBilPath,outTifPath):
inBil = gdal.Open(inBilPath)
driver = gdal.GetDriverByName('GTiff')
outTif = driver.CreateCopy(outTifPath, inBil, 0)
inBil = None
outTif = None
Next, in the Data/Soil/bilToTif.py file, we have a
restructureTIF function that restructures the .tif dataset using
the options specified in the restructureTIF.
def restructureTIF(in_tif, out_tif):
options = gdal.WarpOptions(
creationOptions=["COMPRESS=DEFLATE","PREDICTOR=2","ZLEVEL=9"],
dstSRS="EPSG:4326",
format="GTiff",
multithread=True,
xRes=0.005, yRes=0.005,
resampleAlg="near",
)
gdal.Warp(destNameOrDestDS=out_tif, srcDSOrSrcDSTab=in_tif, options=options)
if __name__ == "__main__":
in_src = os.path.join("HWSD_RASTER","hwsd.bil")
out_src = os.path.join("HWSD_VECTOR","HarmonizedWorldSoilDatabase_RAW.tif")
BILtoTIF(in_src,out_src)
if os.path.isfile(out_src):
restructuredTifPath = os.path.join("HWSD_VECTOR","HarmonizedWorldSoilDatabase_RESTRUCTURED.tif")
restructureTIF(in_tif=out_src, out_tif=restructuredTifPath)
We can see the different options specified in the code block above.
The project’s root directory has a dvc.yaml file. This dvc.yaml
file contains an established DVC pipeline.
vars:
- python: C:\Develop\anaconda\envs\gdal\python.exe
stages:
HarmonizedWorldSoilDatabase:
cmd: python bilToTif.py
wdir: Data/Soil
deps:
- HWSD_RASTER\hwsd.bil
- bilToTif.py
outs:
- HWSD_VECTOR\HarmonizedWorldSoilDatabase_RAW.tif
- HWSD_VECTOR\HarmonizedWorldSoilDatabase_RESTRUCTURED.tif
The DVC pipeline does the following:
* Runs the bilToTif script
* Pushes the processed data to a remote storage
In the root directory, there is a .github folder that contains a
workflows folder. This .github/workflows directory contains a
health-check.yaml file.
The github/workflows/health-check.yaml file holds the GitHub Actions
responsible for recreating this DVC pipeline.
name: dvc-report
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.x'
- uses: iterative/setup-dvc@v1
- uses: iterative/setup-cml@v1
- name: add dvc
env:
repo_token: ${{ secrets.GITHUB_TOKEN }}
run: |
echo "# DVC REPORT" > report.md
echo "## Files and Directories currently tracked" >> report.md
dvc list -R --dvc-only . >> report.md
cml-send-comment report.md
It is important to note that GitHub Actions only runs when there is a change in the python script, producing a different data output.