Experimental: Define data catalog using Python

Motivation

As you may know, Kedro allows users to define datasets in their project in a data catalog using YAML syntax. A dataset definition in the catalog looks something like this:

bikes:
  type: pandas.CSVDataSet
  filepath: data/01_raw/bikes.csv

Having a central repository for all of your datasets provides full visibility into the data landscape of your project. Furthermore, the YAML syntax is beginner-friendly and can be useful as a communication tool between your data engineers, data scientists and non-technical stakeholders.

On the other hand, there are a number of shortcomings as well:

  • No native IDE support for the YAML syntax. Last year, one of our long-time contributors, @mzjp2, added JSONSchema support the catalog, which provides autocompletion and validation in popular IDEs such as VSCode and Pycharm. However, these still need to be manually updated with every Kedro release.
  • Huge catalog is very hard to grok in YAML. A lot of Kedro projects deal with hundreds of datasets and managing a thousand-line long YAML file is not fun. There are ways for you to break the YAML up, e.g. using a modular pipeline structure in the conf/ directory (I might write about this in the future), but complicated YAML is still very error prone.
  • You can't introduce logic in the YAML file. For example, if you are dealing with datasets from 20 different countries, each of them will need an entry in the catalog and it's very repetitive work:
uk_bikes:
  type: pandas.CSVDataSet
  filepath: data/01_raw/uk/bikes.csv

us_bikes:
  type: pandas.CSVDataSet
  filepath: data/01_raw/us/bikes.csv

# repeat this for all other countries

The idea

Having said all that, wouldn't it be cool to be able to fallback to a Pythonic syntax to declare your data catalog? IDE support, nested structure, dynamic scripting, etc. will all come out of the box for free.

Note: there is a typo in the screencast above. The
Note: there is a typo in the screencast above. The filepath should be data/01_raw/iris.csv. I'm just too lazy to re-record it

Well, as it turns out, this is relatively easy to enable with less than 20 lines of code!

Implement a Pythonic DataCatalog

In the demo above, there is a catalog.py file in your project directory. This file contains dataset definitions as classes. For example:

from kedro.extras.datasets import pandas


class ExampleIrisData:
    name = "example_iris_data"
    type = pandas.CSVDataSet
    filepath = "data/01_raw/iris.csv"

CATALOG = [ExampleIrisData]
New Pythonic catalog definition
example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/iris.csv
Current YAML catalog definition

As you could see, the Pythonic catalog looks almost exactly the same as the current YAML syntax with the exception of the additional name property. To make the new catalog work, you simply need to write a custom ConfigLoader to parse the CATALOG definition in catalog.py file in addition to parsing the YAML configuration and merge them together. Below is the full implementation of this ConfigLoader:

import inspect
from typing import Any, Dict
from kedro.config.templated_config import TemplatedConfigLoader
from demo_kedro_pythonic_catalog.catalog import CATALOG


def _get_pythonic_catalog_config() -> Dict[str, Any]:
    """Parse datasets defined as Pythonic classes from catalog.py"""
    catalog_config = {}
    for _, dataset_definition_class in CATALOG:
        dataset_conf = {}
        dataset_name = getattr(dataset_definition_class, "name")
        for field_name, field_value in inspect.getmembers(dataset_definition_class):
            # ignore all functions, methods and magic members of the class
            if (
                inspect.isroutine(field_value)
                or field_name.startswith("__")
                or field_name == "name"
            ):
                continue

            # turn the dataset type into a string
            if field_name == "type":
                dataset_conf[
                    field_name
                ] = f"{field_value.__module__}.{field_value.__name__}"
            else:
                dataset_conf[field_name] = field_value

        catalog_config[dataset_name] = dataset_conf
    return catalog_config


class PythonicCatalogConfigLoader(TemplatedConfigLoader):
    """Custom ConfigLoader to load datasets defined as Pythonic classes from catalog.py"""

    def get(self, *patterns: str) -> Dict[str, Any]:
        """
        Overwrite the `get` method to read catalog configuration
        from the catalog.py module and merge it with the default YAML configuration.
        """
        #  when parsing catalog configuration, parse additional pythonic catalog configuration
        # and merge it with the default
        if "catalog*" in patterns:
            pythonnic_catalog_config = _get_pythonic_catalog_config()
        else:
            pythonnic_catalog_config = {}

        default_config = super().get(*patterns)
        default_config.update(pythonnic_catalog_config)
        return default_config

The main idea behind this implementation is simple: When looking for the catalog definition, it reads the imported CATALOG, converts all dataset classes it contains into a dict before merging it with the default configuration from the conf/ directory. Hopefully, the conversion logic is self-explanatory. You can register this ConfigLoader with the project as usual through the register_config_loader hook in hooks.py:

class ProjectHooks:
    @hook_impl
    def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
        return PythonicCatalogConfigLoader(conf_paths)

To confirm that it works, we can list the catalog in the __default__ pipeline and should see the example_iris_data in the output:

$ kedro catalog list --pipeline="__default__"

DataSets in '__default__' pipeline:
  Datasets mentioned in pipeline:
    CSVDataSet:
    - example_iris_data
    DefaultDataSet:
    - example_model
    - example_test_x
    - example_train_y
    - example_train_x
    - example_test_y
    - example_predictions

Demo of benefits

As mentioned in the Motivation section, one of the problems of the YAML syntax is that the catalog definition can't be scripted. Now that we have the full weight of Python behind our catalog definition, it's very easy to dynamically create catalog entries. For example, let's create a dataset for a list of given countries:

for country in ["uk", "us", "au", "fr", "es"]:
    dataset_class_name = f"{country.upper()}ExampleIrisData"
    dataset_name = f"{country}_example_iris_data"
    dataset_def = type(
        dataset_class_name,
        (),
        {
            "name": dataset_name,
            "type": pandas.CSVDataSet,
            "filepath": f"data/01_raw/{country}/iris.csv",
        },
    )
    CATALOG.append(dataset_def)

Now when listing the catalog again, you should see these datasets automagically appear:

$ kedro catalog list --pipeline="__default__"

Datasets mentioned in pipeline:
    CSVDataSet:
    - example_iris_data
    DefaultDataSet:
    - example_predictions
    - example_model
    - example_train_y
    - example_test_y
    - example_test_x
    - example_train_x
  Datasets not mentioned in pipeline:
    CSVDataSet:
    - us_example_iris_data
    - au_example_iris_data
    - uk_example_iris_data
    - es_example_iris_data
    - fr_example_iris_data

Another benefit of having a catalog defined as code, as opposed to configuration, is that you get to package it with your project using kedro package.

The full working Kedro project with this Pythonic catalog implementation can be found here.

Final thoughts

I hope you find this experiment useful. A keen reader might have realised that there is no support for environment yet, i.e. we can't create environment-specific catalog.py because there is no env exposed to the register_config_loader hook, although we can certainly hack it with an environment variable. However, it is a pressing problem for a few users, so I think the team might fix it soon (maybe even in the coming 0.17.1 release). When that happens, I will package this up into a plugin.

Another thought is in this tutorial, we are essentially rolling out own serialisation mechanism for Python classes to stay simple, but you can certainly use other serialisation tool such as pydantic as well.

Finally, if you have any feedback for this experiment, please let me know by sending an email to kedrozerotohero@gmail.com or head over to https://discourse.kedro.community/ as a bunch of us hang out there :)