Integrating Databricks Azure Data Platform with Terraform: A Step-by-Step Guide
Why we need Data Platform
Data Platform is foundation block in All Data driven organisation. As name indicate it is a place to facilitate all data activity involving various users like Data Engineer, Data Analyst, Data Scientist. Data Platform can be implemented using on-premis , Cloud or Hybrid infrastructure. AWS, Azure and GCP is major cloud provider in market. Organisation may use single or multi cloud as per there need. In this article we will see how we have created data platform for our internal analytics team.
from Gartner : “Most organisations adopt a multi-cloud strategy out of a desire to avoid vendor lock-in or to take advantage of best-of-breed solutions”
As data platform involved many services and complex settings including network, resource access e.t.c it is good practice to use Infrastructure As Code (IaC) to build and maintain data platform.
Infrastructure As Code
Infrastructure as Code (IaC) is a method of managing and provisioning infrastructure using code rather than manual procedures.
By employing IaC, infrastructure specifications are defined in configuration files, which simplifies configuration modifications and sharing. This approach guarantees consistent environment provisioning every time. Documenting and encoding configuration requirements with IaC facilitates configuration management and prevents undocumented and arbitrary configuration alterations.
Consistency: IaC enables defining infrastructure in code, version-control, testing, and consistent deployment across environments, avoiding errors from manual configuration and ensuring uniformity.
Automation: IaC automates provisioning, configuration, and management of infrastructure enabling quick spin up of instances, resource scaling, and updates without human intervention..
Reusability: IaC enables defining infrastructure components as reusable modules, promoting collaboration, consistency and reducing duplication of effort across teams and projects.
Scalability: IaC enables scalable and elastic infrastructure, allowing swift and automated resource scaling as per demand, facilitating prompt and efficient response to workload changes without manual intervention.
Speed: IaC enables faster infrastructure deployment by automating manual provisioning and configuration tasks, leading to quicker delivery of new features and services to customers..
Collaboration: IaC facilitates team collaboration, reducing silos and improving communication through shared infrastructure configurations, resulting in more effective and efficient infrastructure management.
IaC and DevOps:
Infrastructure as Code (IaC) is an essential part of implementing DevOps practices and continuous integration/continuous delivery (CI/CD) for streamlined application deployment. With IaC, provisioning work is taken away from developers, enabling them to execute a script to have their infrastructure ready to go. This approach ensures that application deployments aren’t held up waiting for infrastructure and sysadmins aren’t managing time-consuming manual processes.
In CI/CD, automation and continuous monitoring are critical for success throughout the application lifecycle, from integration and testing to delivery and deployment. However, to achieve automation, the environment needs to be consistent. Automating application deployments is ineffective when development and operations teams use different approaches.
IaC also removes the need to maintain individual deployment environments with unique configurations that cannot be reproduced automatically. It ensures that the production environment remains consistent and avoids configuration drift.
DevOps best practices are also applied to infrastructure in IaC, allowing infrastructure to go through the same CI/CD pipeline as an application does during software development. This approach applies the same testing and version control to the infrastructure code, enabling streamlined infrastructure management.
Infrastructure As Code With Azure :
Azure natively provides ARM Templates and Azure Bicep to support IaC for creating and managing Azure resource. ARM uses json based configuration which can be parametrised and reused while Azure Bicep uses a declarative syntax that you treat like application code.
Bicep and ARM is best for Azure only data platform but in case of multi cloud we might need to look for other IaC solution such as Ansible, Chef Terraform e.t.c.. In this Article we will see how we have used terraform for setting up Azure data platform.
Terraform
Terraform provides Infrastructure As Code capabilities to build and manage cloud resources. It is open source and supports multiple cloud. For this article we assume reader has basic understanding of Terraform.
How we Set up Azure Data Platform with Terraform
We in Cuelebre internally work on various AI and analytics projects. It requires often adding new services providing access to developers. Our Platform team which include limited resources come up with IaC solution to build and manage platform.
Our Data & AI team is using Azure data stack which uses various analytics services as shown in figure
- Azure Data lake Storage (Gen2 ) : low cost hierarchical Object storage.
- Azure Data Factory : used for Data ingestion and orchestration.
- Databricks : Spark based modern compute support lakehouse architecture .
- Unity Catalogue : Governance solution for data and AI assets including, used by data
- Azure key vault : Azure based secrete store to store sensitive data.
Other than primary services, team also uses Azure kubernetes cluster, Azure container registry, VMs e.t.c. In this article we will focus on primary resources.
Dev, Test and Prod Environment needs to be created with Azure Active directory and RBAC based Access control .
We have set up following module to organise our IaC code
- Core: Reusable module of terraform used by env specific repo
- Env specific repo : dev, test and prod repo for env specific implementation
- Resource Access manager: Access configuration to manage users, group and resource access management.
Naming Convention : Consistent naming is best practice to manage resource at scale. Microsoft provides Cloud adoption framework guideline for naming azure resources. You can find more here.
{Org}-{Resource type}-{application}-{Environment}-{Instance}
Example:
Following diagram shows how the naming looks like for all environments resource.
Setting up Terraform for Azure Infra:
Authentication :Terraform supports number of methods for authenticating to Azure, such as CLI, Service principal and Managed identity. This article usages recommended way i.e. Service Principal with client secret.
Remote State : Terraform must store state about your managed infrastructure and configuration. This state is used by Terraform to map real world resources to your configuration, keep track of metadata, and to improve performance for large infrastructures.
Terraform writes the state data to a remote data store in this case ADLS storage, which is use-full in case multiple team member working in same code
configure remote state store : Azure ADLS provide access Key which can be used to authenticate and authorise terraform while writing state information on remote storage.
we have used common script using azure cli to create non terraform resource. ResourceGenerator.sh :
- Create Resource group required environment.
- Create ADLS G2 Storage and environment specific container for terraform remote state store.
- Create Key vault to store secret for service principal and state store key
- Create Service principal required for various environment.
- Store service principal secrete in azure key vault .
Once we created all resource required to set up terraform we see how to configure terraform using above.
- Authentication : azurerm and azuread need to be configured with service principal details. we achieve this using terraform variables. as this consist sensitive information we can provide value using class variables.
Passing value to terraform variable
2. Configuring remote backend: we have to provide the storage account name, container name and key (location) to configure remote state store . Authentication being done by the TF_VAR_access_key variable.
TF_VAR suffix is used to include system variables in terraform variable example : if you need to use myvar in terraform you can export TF_VAR_myvar in system path and myvar is available for terraform.
Building the Data Platform:
platform_dev is the repo which consist the code responsible for creating/provisioning resource in azure cloud. terraform project consist following files :
- global.tf: Provider and backend config.
- dev_terraform.tfvars: value for variables
- variables.tf : defines variables used in project.
- main.tf: resource provisioning logic for datalake.
main.tf creates following resource on azure :
- Azure Data lake Storage (Gen2 )
- Azure Data Factory .
- Databricks .
- Azure key vault
Organising Code in git repos
We have used environment specific repository as all environment can have separate frequency and requirement to build and manage. A set up can also use single repo and handle all environment with env specific variables. we keep our admin script and other utility in platform_core . platform_tfmodules will consist reusable module and will be imported in other terraform repository . platform_dev will consist terraform code for creating development resources. We will also have one more repository platform_ad for assigning access to users by mapping users to Azure AD groups.
Here we are discussing only on Dev environment creation and keeping it in scope for this article.below is the Repo with there respected service principal (used in authentication)
Test and Prod environment not included in this article and in code*
Access Control for Terraform and Data lake Platform :
Following figure shows how access for overall platform looks like.
Service principals :
- As mentioned earlier we have service principal DGB-SP-{ENV}-001 which is being used for terraform authentication.
- DGB-AD-PROD-001 service principal used for managing the resource access
Azure Active directory role:
- SUP-ADMINS : Having access on subscription level and very restricted group.
- DGB-AAD-DL-ADMIN-{ENV}: users in this group can access and manage the resource within the resource group example DEV-Admin , Test-Admin.
- DGB-AAD-DL-USER-{ENV}: users in this group can access the resource but they can not create any resource .
All together :
with full setup including test and prod environment our platform for Data analytics looks like this .
Git reference
Reference:
Code Contributor : Mohit Dungarani