Analytics Engineering on the Lakehouse using DBT & Databricks (part-1 )

Anuj Sen
5 min readMar 26, 2023

Introduction :

Lakehouse is a data management system that merges the advantages of both data lakes and data warehouses. It facilitates the processing of both structured and unstructured data on a single platform, allowing faster analysis and insights. Data is first ingested into a data lake, where it’s stored in its raw format. It’s then transformed and structured in real-time using a processing layer such as Apache Spark, and stored in a data warehouse that can be easily queried using SQL. The lakehouse architecture is increasingly popular due to its cost-effectiveness and flexibility in handling large volumes of data.

Databricks provides lakehouse features by combining data lake and data warehousing capabilities on a unified platform that uses Apache Spark and Delta Lake. Delta Lake offers ACID transactions, schema enforcement, and indexing, making it suitable for managing both structured and unstructured data in a lakehouse architecture. Databricks also provides Spark SQL and Structured Streaming tools for querying data and real-time processing of data streams. By offering a complete solution for data processing, Databricks enables organizations to build cost-effective and flexible lakehouse architectures that can handle large volumes of data and support real-time analytics and machine learning.

The Databricks Lakehouse Platform greatly simplifies data architectures by
combining the data management and performance typically found in data
warehouses with the low-cost, flexible object stores offered by data lakes.

DBT: it is a configuration driven framework for data transformation. Dbt gives a structured approach for analytics combining software engineering best practices like version controlling , Testing , CICD. Transformations described in form of standard SQL and dbt handles the rest.

DBT Feature :

  1. Configuration Driven

One of the key features of dbt is its configuration-driven approach to data transformation and processing.

Users can define their data pipeline using YAML configuration files, which makes it easy to set up, configure, and maintain their data pipeline.

This approach allows users to create and manage complex data pipelines with ease and can help teams maintain consistency across projects.

2. Integration with SQL, Jinja, and Python

Another important feature of dbt is its integration with SQL, Jinja, and Python.

Users can use SQL to define data transformations.

Jinja to create reusable templates and macros, and Python to extend the functionality of dbt.

This integration allows users to create complex and flexible data pipelines that can handle a variety of data sources and formats

3. Modularity and incrementality

dbt also provides modularity and incrementality, allowing users to break down their data pipeline into smaller, more manageable parts.

This approach makes it easier to test and validate data transformations, maintain the pipeline over time, and troubleshoot issues when they arise.

Incrementality allows users to update only the parts of the pipeline that have changed, reducing the time and effort needed to update the entire pipeline

Responsibility Matrix in Lake house

In this article we will go through step by step to perform analytics on Databricks platform using dbt.dbt comes with 2 offerring

  1. opensource dbt core
  2. dbt cloud a managed service

in this article we limit our scope for dbt core.

Set up : this article is usages mac as os, visual studio code as IDE. python , git and Databricks are prerequisites for this article

DBT Project and Env setup (one time at project level )

  1. create an empty folder and open that in vs code.
  2. Initialise and activate virtual env
python3.9 -m venv dbt_databricks
source dbt_databricks/bin/activate

3. install dbt-databricks

pip install dbt-databricks

4. confirm dbt installed successfully

dbt - version

5. init dbt project

dbt init

Note : dbt init will ask following details [more]

  • name of project (in this article we used dbtdemo)
  • database type (1 for databricks)
  • host :(host of databricks workspace we will use from env variable )
  • http_path: http path provided by cluster or sql server
  • token: token to authenticate datbricks workspace
  • Catalog : name of unity cataloge if enabled else not required
  • schema : name of default schema

above details will be saved in user home dir {user home}/.dbt/profiles.yml we will override the details for better approach

after above steps project structure will look like

6. using custom profile dir create profiles dir inside the dbtdemo project and add profiles.yml file. update required details and notice the host http_path and token value. we will provide these value from env variable .

7. Providing env value

export DBT_DW_HOST=adb-XXXXXXXXXXX.xx.azuredatabricks.net
export DBT_DW_PATH= sql/protocolv1/o/xxxxxxxxxxx/xxxx–xxx-xxx
export DBT_DW_TOKEN=dapxxxxxxxxxxxxxxxx-x

8. Connection Test

cd dbtdemo
dbt debug - profiles-dir ./profiles/

9. install dbt packages

A dbt package is a collection of dbt reusable code that are published and shared through the dbt hub. Packages can be created and shared by individual users or companies and can include custom macros, models, and other functionality.

we will use dbt package to define external table source on azure data lake storage. and dbt utils package to use some utility functions.

we will add packages.yml with package name and run following command

dbt deps

Now we have set up the dbt environment, In next part we will develop our first pipeline using above set up .

Note : dbt cloud can save you from doing all above hard work and provides end to end integrated environment .

--

--

Anuj Sen

Software Engineer | Big data | Student and Mentor