How to inject secrets into your Kedro configuration

Introduction

By far, one of the most common questions I receive when supporting a Kedro project is:

How to inject secrets into Kedro configuration, e.g. how to define a database connection string which contains username and password for the pandas.SQLTableDataSet.

This is a good question because you are not supposed to hard code secrets as plaintext and check them into version control. If your infrastructure is setup correctly, at runtime, secrets will be distributed to your machine / container in some convenient format such as environment variables or a file in a mounted volume. In this tutorial, we will look at how to form a database connection string using database credentials, which is one type of secrets, from environment variables and inject it into a pandas.SQLTableDataSet.

Prerequisite

Let's spin up a database with some test data to play with. The easiest way to do it is with Docker. For example, let's spin up a Postgres database instance using docker with kedro as username, mysecretpassword as password and test-kedro as the database name.

docker run --name test-kedro-postgres\
	-e POSTGRES_USER=kedro\
	-e POSTGRES_PASSWORD=mysecretpassword\
	-e POSTGRES_DB=test-kedro\
	-p 5432:5432\
  -d postgres

Now let's add some test data into the database by creating a file called preload.sql with the following content (copied from codefresh.io)

CREATE TABLE link (
   ID serial PRIMARY KEY,
   url VARCHAR (255) NOT NULL,
   name VARCHAR (255) NOT NULL,
   description VARCHAR (255),
   rel VARCHAR (50)
);

INSERT INTO link (url, name)
VALUES
 ('http://www.google.com','Google'),
 ('http://www.azure.microsoft.com','Azure'),
 ('http://www.codefresh.io','Codefresh');

Then load it into the database using psql

psql -h localhost -U kedro test-kedro < preload.sql

Note that when prompted for password, you should enter the password mysecretpassword as defined in an earlier step. Now we are ready to write a Kedro's DataCatalog entry to read this test data with pandas.SQLTableDataSet.

Defining the dataset

In the conf/base/catalog.yml file, add the following dataset definition:

test_kedro_postgres_dataset:
	type: pandas.SQLTableDataSet
	table_name: link
	credentials: test_kedro_postgres_credentials

This tells Kedro to use pandas.SQLTableDataSet to read data from the link table. To define the credentials, edit the credentials file at conf/local/credentials.yml with the following content:

test_kedro_postgres_credentials:
  con: "postgresql://kedro:mysecretpassword@localhost/test-kedro"

This gives Kedro the value for test_kedro_postgres_credentials which we used earlier for the dataset's definition. Don't worry that we are hard-coding the value of the connection string in the credentials for now. We will fix it in the next session. For now, let's launch a Kedro IPython session with the kedro ipython command to test out our dataset:

In [1]: catalog.load("test_kedro_postgres_dataset")
2021-01-28 11:41:29,200 - kedro.io.data_catalog - INFO - Loading data from `test_kedro_postgres_dataset` (SQLTableDataSet)...
Out[1]:
   id                             url       name description   rel
0   1           http://www.google.com     Google        None  None
1   2  http://www.azure.microsoft.com      Azure        None  None
2   3         http://www.codefresh.io  Codefresh        None  None

This tells us that our dataset definition is correct. Now is the time to fix the hard-coded credentials problem.

Use TemplatedConfigLoader to remove the hard-coded secrets from the credentials file

TemplatedConfigLoader is a mechanism in Kedro for you to specify your configuration with placeholders that will get interpolated with the correct values at runtime. For example, let's change the credentials definition in conf/local/credentials.yml with:

test_kedro_postgres_credentials:
  con: "postgresql://${DB_USERNAME}:${DB_PASSWORD}@localhost/${DB_NAME}"

This tells Kedro that we will supply the value of DB_USERNAME, DB_PASSWORD and DB_NAME for this connection string at runtime.

Next, populate some environment variables with the correct values for these secrets:

export ENV_DB_USERNAME=kedro
export ENV_DB_PASSWORD=mysecretpassword
export ENV_DB_NAME=test-kedro

Finally, in the register_config_loader method in hooks.py where you would normally register a ConfigLoader, change the definition to use a TemplatedConfigLoader instead:

import os

from kedro.config import TemplatedConfigLoader

class ProjectHooks:
    @hook_impl
    def register_config_loader(self, conf_paths: Iterable[str]) -> TemplatedConfigLoader:
        return TemplatedConfigLoader(
            conf_paths,
            globals_dict={
                "DB_USERNAME": os.environ.get("ENV_DB_USERNAME"),
                "DB_PASSWORD": os.environ.get("ENV_DB_PASSWORD"),
                "DB_NAME": os.environ.get("ENV_DB_NAME"),
            },
        )

Voila! This reads the value of your secrets from environment variables, creates a TemplatedConfigLoader with these values and inject them into the templated credentials in conf/local/credentials.yml file. If you launch a Kedro IPython session, you will be able to load the data as before with catalog.load("test_kedro_postgres_dataset") but this time there was no hard-coded connection string in your credentials configuration.