Kedro: Software Engineering Principles for Data Science

Prathit Pannase
3 min readOct 27, 2020

Kedro in simple terms is a tool that applies software engineering principles to data science and data engineering work and hence making it easy to deploy machine learning models in production. It was developed by BlackStone & McKinsey and then later open sourced.

Why do we need Kedro?

Imagine you are writing a piece of code that performs some sort of data analytics or provides some insights about a particular process by using machine learning within the company. You share the results with your manager and the topic ends there. But now after a few months, the manager asked you to run the same analysis again or maybe add a new parameter to it. The problem now occurs is that you had hardcoded most of the data files path and the environment used to run the model then is not working anymore or you have overwritten your code since then without any version control. There can be many such issues that can occur over a longer period of time.

Most of the data scientists today focus on Machine Learning or Statistics but very few focus on code quality and implementing software engineering principles while developing their solutions.

How Kedro helps with Production Level Machine Learning

  • It provides a very easy to use template and hence allows all the collaborators to quickly understand the project
  • It easily connects with Docker
  • Promotes reusable analytics code so that you don’t have to start from scratch every time

Quote by QuantumBlack: The only useful data science code is production-level.

Kedro in terms of Movie Recommendation System

To test out Kedro, I used the following documentation. The documentation provides a step by step process of installing Kedro and setting it up for your particular project.

The way Kedro helped me with the Movie Streaming Data project is that it sets up a well-defined structure. If you see the image below all the data set go into the data folder which is in tandem with the software engineering principle of not hard-coding your data.

As shown below it also helps to modularize and assemble the data engineering and data science separately while providing a notebooks tab to store all your notebooks.

This how I utilized Kedro for adopting more software development framework for the movie streaming project

Strengths of Kedro

  • Support for all types of data format
  • Nested Pipelines (One pipeline can be used as sub-pipeline for another)
  • Built in support for database access for CSV, YAML, Spark, SQL and public clouds (AWS S3, GCP GCS)

Limitations of Kedro

  • Automatic support for using intermediate data files or databases to resume to a pipeline is not present
  • The User Interface does not provide progress monitoring feature
  • Lot of other unused package dependencies are included in requirements.txt

--

--