Extend dbt#

As noted in data build tool (dbt), you can install your own custom dbt package to apply your own transforms to the event data in Aspects.

This guide demonstrates how to create and use a custom dbt package in Aspects by building the sample-aspects-dbt repo.

Step 1. Install dbt#

The easiest way to install dbt is to use pip in a python3 virtual environment.

# Create and activate a python3 virtual environment
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip

Check the aspects-dbt requirements.txt for the specific package versions used by Aspects. Both dbt-core and dbt-clickhouse are required.

# Install the dbt package versions used by Aspects
pip install dbt-clickhouse==x.x.x dbt-core==x.x.x

See Install dbt for more ways to install dbt.

Step 2. Set up new dbt package#

  1. Create a new repository for your custom dbt package, and clone it to your local machine.

  2. In the root of your local repository, create a new dbt package by following the prompts given by the dbt init tool. See About dbt init for more options.

    # Use the profile name from your ``dbt/profiles.yml`` file or 'aspects'
    dbt init --profile aspects
    
    # Enter a name for your project (letters, digits, underscore):
    sample_aspects_dbt
    
    ls sample_aspects_dbt
    # analyses/  dbt_project.yml  macros/  models/  README.md  seeds/  snapshots/  tests/
    

    Attention

    Possible error when running dbt init:

    No sample profile found for clickhouse
    

    Make sure the profile matches your local dbt/profiles.yml profile name.

    # This setting configures which "profile" dbt uses for this project.
    profile: "aspects"
    
  3. In dbt_project.yml, set the location for compiled SQL to match the aspects target path:

    # directory which will store compiled SQL files
    target-path: "target"
    
  4. Create a requirements.txt file at the top level of your repository with the same version of dbt-core as in Step 1. This is needed when building the aspects image.

    # requirements.txt
    dbt-core~=x.x.x
    

    Note

    If you need any python dependencies beyond what is provided by aspects-dbt, add these to the requirements file.

Step 4. Test dbt connection#

Before adding your custom transforms, it’s a good idea to test that your dbt package can connect to the Clickhouse database and run the base transforms from aspects-dbt.

  • Run dbt debug to test the connection to your database and the validity of your dbt project.

  • Run dbt run to run the base transforms from aspects-dbt.

    • You may need to run dbt run --full-refresh if the previous step fails.

Attention

Possible error when running dbt commands:

Compilation Error
dbt found 1 package(s) specified in packages.yml, but only 0 package(s) installed in dbt_packages.
Run "dbt deps" to install package dependencies.

Run dbt deps to install the dependencies for your package, including aspects-dbt.

Step 5. Create your custom transforms#

Here is where you will need an understanding of dbt, Clickhouse, Aspects’ data schemas, and the specific transforms you want to create.

Tip

You can use Aspects to debug your custom SQL:

  1. Login to Superset as an Open edX superuser.

  2. Using the menus at the top of the page, navigate to the “SQL -> SQL Lab” UI.

  3. Browse the schemas and run read-only SQL queries on your data.

For this tutorial, we created two new models - course_enrollments and learner_responses and which will be materialized by dbt into a view and materialized view in Clickhouse. More information on materialized views here.

Step 6. Add dbt tests#

Writing tests for your transforms is important to validate and document your intended changes, and guard against data edge cases and regressions from future code changes.

There are two types of dbt tests; data tests and unit tests. Run dbt test to run all tests for your package.

Data tests#

Data tests can be defined in the schema.yml file for each model, and are used to validate properties of the data such as types, accepted values, uniqueness, and relationships between tables.

- name: enrollment_mode
  description: "The enrollment mode of the user"
  data_tests:
    - accepted_values:
        values: ["audit", "honor", "verified"]

Data tests can also be defined in a SQL file, where the goal of the SQL statement is to return zero records.

select count(*) as num_rows
from {{ ref('enrollment_mode') }}
group by org, course_key, enrollment_mode, course_name, course_run
having num_rows > 1

Unit tests#

Unit tests are used to validate the logic of your dbt models in isolation from the underlying data. They are defined in a unit_tests.yaml file within the models directory.

Step 7. Install and use your dbt package#

Once you’ve pushed all the changes to your custom dbt package repo, you’re ready to use it.

Use tutor config save to update the following Tutor variables to use your custom package instead of the Aspects default.

  • DBT_REPOSITORY: The git repository URL of your custom dbt package.

    (Default: https://github.com/openedx/aspects-dbt)

  • DBT_BRANCH: The hash/branch/tag of your custom dbt package that you wish to use.

    (Default: the latest tagged version of aspects-dbt)

Optional
  • EXTRA_DBT_PACKAGES: Add any python packages that your dbt project requires here.

  • DBT_PROFILE_*: variables used in the Aspects dbt/profiles.yml file, including several Clickhouse connection settings.

  • DBT_SSH_KEY: The private SSH key to use when cloning the dbt project. Only necessary if you are using a private repository.

Once your package is configured in Tutor, you can run dbt commands directly on your deployment.

# Build and test your package
tutor dev do dbt -c "build"

# Deploy your customizations
tutor dev do dbt -c "run"

# Run data tests on the data
tutor dev do dbt -c "test"

# Run unit tests on the data
tutor dev do dbt -c "test --selector unit_tests"

# To push your new transformations to Superset SQL Lab
tutor dev do import-assets

See dbt commands for a full list of available commands.

Troubleshooting#

  • Tutor dbt commands may need to be run with --only_changed False to force a full dbt run if you have made changes to your dbt package that are not being picked up.

  • Don’t forget to push your changes to your repo before running the tutor dbt command: it fetches a clean copy of your configured package repo + branch each time it runs.

Next Steps#

With your custom dbt package, you can create new charts and dashboards in Superset using the new models created by your dbt transforms. See Add extra Superset assets for more information.

References#