Medallion Architecture Azure is a scalable data solution that can handle large volumes of data. It's designed to be highly available and durable, making it perfect for big data workloads.
Azure provides a wide range of storage options, including Azure Blob Storage, Azure Data Lake Storage, and Azure File Storage. These storage options can be used to store and manage data in different formats and sizes.
Medallion Architecture Azure is built on top of Azure's scalable infrastructure, which allows it to scale up or down depending on the workload. This means that you can start small and scale up as your data needs grow.
With Medallion Architecture Azure, you can take advantage of Azure's built-in features such as data encryption, access control, and data replication. These features ensure that your data is secure and always available.
Design Patterns and Architecture
A medallion architecture is a data design pattern used to organize data logically, incrementally improving the structure and quality of data as it flows through each layer of the architecture.
This architecture is composed of three main layers: Bronze, Silver, and Gold. The Bronze layer is where raw data ingestion occurs, while the Silver layer performs data cleaning and validation. The Gold layer is where dimensional modeling and aggregation take place.
The Bronze layer is intended for data engineers, data operations, and compliance and audit teams, who use it for raw data ingestion. In contrast, the Silver layer is used by data engineers, data analysts, and data scientists for more refined datasets that still retain detailed information.
Here's a breakdown of the layers and their intended users:
The medallion architecture is a recommended best practice, but not a requirement. It's designed to make data more suitable for business intelligence and machine learning applications by incrementally improving data quality and reliability.
Data Ingestion and Processing
Data ingestion is the process of bringing raw data into your Azure environment. This is typically done in the bronze layer, where data is stored in its original format and is intended for consumption by workloads that enrich data for silver tables.
The bronze layer contains raw, unvalidated data, which is appended incrementally and grows over time. It serves as the single source of truth, preserving the data's fidelity, and enables reprocessing and auditing by retaining all historical data.
To control costs, you can adjust the frequency of data ingestion. There are three main options: continuous incremental ingestion, triggered incremental ingestion, and batch ingestion with manual incremental ingestion.
Here's a breakdown of the costs and latency associated with each option:
Continuous incremental ingestion is the most expensive option, but it provides the lowest latency. Triggered incremental ingestion is a good compromise between cost and latency, while batch ingestion with manual incremental ingestion is the most cost-effective option, but it has the highest latency.
As you move up the data processing stages, you'll refine and enrich your data through cleansing, deduplication, joining, and other transformations. This is often done in the silver layer, where data is refined and enhanced to make it ready for downstream usage.
Data Analytics and Business Logic
Data analytics and business logic are intricately linked in the medallion architecture on Azure. The gold layer is where you'll model your data for reporting and analytics, aligning with business logic and requirements.
This dimensional model establishes relationships and defines measures, allowing analysts to find domain-specific data and answer questions. The gold layer is optimized for performance in queries and dashboards.
The gold layer consists of aggregated data tailored for analytics and reporting, which is often highly aggregated and filtered for specific time periods or geographic regions. It contains semantically meaningful datasets that map to business functions and needs.
Some customers create multiple gold layers to meet different business needs, such as HR, finance, and IT, as the gold layer models a business domain.
Align with Business Logic
Aligning with business logic is crucial for effective data analytics. The gold layer is where you'll model your data for reporting and analytics using a dimensional model by establishing relationships and defining measures.
Analysts with access to data in the gold layer should be able to find domain-specific data and answer questions. This is because the gold layer models a business domain.
Some customers create multiple gold layers to meet different business needs, such as HR, finance, and IT. This allows them to tailor their data analytics to specific areas of their business.
The gold layer is optimized for performance in queries and dashboards, making it ideal for analytics and reporting. It consists of aggregated data tailored for analytics and reporting, aligning with business logic and requirements.
Silver Layer: Processing
The Silver Layer is a crucial processing stage in data analytics, often overlooked but essential for refining and enriching data. This stage involves cleansing, deduplication, joining, and other transformations to enhance data quality and make it ready for downstream usage.
Data enrichment is a complex task, and one of the biggest challenges is dealing with discrepancies in join keys or lookup values. This highlights the importance of monitoring completeness and uniqueness at the attribute level.
Effective profiling and binning of bad-quality data can significantly reduce cloud compute costs, as this erroneous data won't be subjected to queries. This approach ensures that only reliable data progresses to the Gold layer.
Data quality binning, profiling, and circuit breaker methods not only mitigate the risks associated with poor data quality but also scale operational efficiency. By isolating and preventing the influx of bad data, you can keep your cloud cost in check.
Data Quality and Management
Data quality and management are crucial aspects of the Medallion architecture in Azure Databricks. The Medallion architecture is designed to organize data across three distinct layers: Bronze, Silver, and Gold.
Minimal data validation is performed in the bronze layer, and it's recommended to store most fields as string, VARIANT, or binary to protect against unexpected schema changes. Metadata columns might be added, such as the provenance or source of the data.
In the Medallion architecture, data moves from raw form to analytics-ready with transparent governance and quality checks. This tiered approach facilitates efficient data processing, management, and analysis.
The Bronze layer ingests raw data from cloud storage, Kafka, and Salesforce, with no data cleanup or validation performed. The Silver layer performs data cleanup and validation. The Gold layer contains fewer datasets and is designed for business users.
Here's a brief overview of the Medallion architecture layers:
Terms Explained
Medallion architecture in Azure is built on top of a managed Kubernetes service called Azure Kubernetes Service (AKS).
AKS provides a managed control plane for the cluster, which means you don't have to worry about the underlying infrastructure.
Azure handles the provisioning and scaling of the control plane components, allowing you to focus on deploying and managing your applications.
The AKS control plane is highly available and scalable, with automatic upgrades and rollbacks to ensure high uptime.
The medallion architecture uses a centralized control plane to manage multiple worker nodes, which are the compute resources that run your application containers.
Frequently Asked Questions
What is the difference between ETL and medallion architecture?
ETL (Extract, Transform, Load) and Medallion architecture are two separate data processing methods, with ETL focusing on data movement and transformation for analytics, whereas Medallion architecture organizes data into three stages (Bronze, Silver, Gold) for a more structured data management approach.
What are the advantages of medallion architecture?
Medallion architecture offers governance benefits and performance advantages, including maintaining original data copies and reusing processed tables. This approach enables efficient data management and reuse.
What is the difference between medallion architecture and data mesh?
Medallion Architecture focuses on organizing data in a lake, while Data Mesh is an organizational pattern that changes how organizations extract value from data. Understanding the key differences between these two concepts can help you unlock more efficient data management and utilization.
Sources
- https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion
- https://github.com/chayansraj/Microsoft-Azure-Medallion-Data-pipeline
- https://learn.microsoft.com/mt-mt/azure/databricks/lakehouse/medallion
- https://www.telm.ai/blog/ensuring-data-quality-for-data-lakehouse-medallion-architecture/
- https://2bcloud.io/medallion-lakehouse-architecture-in-microsoft-fabric/
Featured Images: pexels.com