Every organisation using Data Product to get business insights and decision making. Data Product can only successful if it helps to provide Right data to Right Decision maker at Right Time. Providing relevant data on desired time is quite challenging.
Data recency of any Data Product relies on its Data Pipeline. Data Pipeline is responsible for collecting data from various upstream systems, standardised that data, and store in Data platform for business consumption such as design ML model, analytical reports etc.
Pipeline robustness depend on various attributes some of them related to source system which is outside of the control of data engineers. To identify such attribute and acknowledge them before setting up the pipeline provide stability in data pipelines.
We in Cuelebre, While working with our clients we focus on identifying such attributes and we try to put it in metrics to calculate confidence index.
Here are some of the attributes and best practices which we follow to keep external impact minimum:
1. Contracts and Communication:
Source System and data team should have a collaborative approach while integrating any pipeline.
* During source onboarding Data Team and source stakeholders should interact frequently to share the knowledge about how data is generating and how data is going to consume.
* Initial clarification can save hours of rework and help to implement correct solution.
*Data Contract should be drafted before starting any implementation and should be agreed by stakeholders.
* Data Contract should have all relevant information such as Stockholder contacts, Source / target System technical / operational Team Contact, escalation metrics, integration mechanism, data generation frequency.
* Data Contracts should be shared centralised place such as confluence and should be accessible for respective teams.
2. Source System Data Management Insights
Information about source system governance such as who manage the data? And how it is modelled at source side, can help to onboard data in efficient way. Data Team should understand about other important point also like:
* Availability of bulk data load capacity like if it is possible to load historical data?
* Does source system use any master data management system for controlling dimensions?
* How any changes at MDM data would impact in data pipelines?
3. Agreed SLA on data delivery:
Source system data contract is an agreement between the source team owner and the team ingesting data on certain points such as
* What kind of data is shared (attribute level)?
* What kind of integration mechanism is used?
* What is data availability?
* Is it available on a fixed schedule or can it be accessed at any time?
* What is the maximum delay in data availability pipeline can expect?
If the above details are documented and available, it can help the development and operation team to design and manage data pipeline in efficient way.
4. Agreed SLA on data quality:
Any Data product using data of poor quality , cannot be trusted. Data Quality SLA is a crucial aspect to improve data product value.
* If Source and target systems agree on a certain data quality, then the data engineering team can apply automated check and constrain to achieve good data quality.
* Example: If a key column consists of 5% null in a certain batch, then that batch should be rejected, and breach in SLA should be considered.
5. Security Aspects:
Data security should be considered as the topmost priority when doing any type of integration with the source system. One can use the following attributes to evaluate data security:
* What security mechanisms are used to authenticate source/target systems?
* Whether the data is encrypted in transit and at rest?
* Process of certificate/password management.
* What type of network access to source system is supported (public vs VPN)?
6. Change Notification Process:
Due to lack of change intimation process any Change in source side can impact your Data pipelines. Source system and the data ingestion team should agree on a process about informing various changes at source side such as:
* Notification of Schema change.
* What is the agreed period of prior intimation?
* Any ongoing incident on source side, if yes then how will it be notified?
Keeping track of the above key points can help to mitigate various risks and improve the Data product quality. Data engineers should have awareness of the source system and how the data is produced and consumed. Defining the SLA can reduce friction between source and data ingestion teams.