Azure Synapse Architecture is a comprehensive platform that enables you to integrate and manage various data sources, making it an ideal choice for building a data analytics ecosystem.
By leveraging Azure Synapse, you can create a centralized data repository that stores and processes data from multiple sources, including Azure Blob Storage, Azure Data Lake Storage, and on-premises data sources.
This unified approach allows for seamless data integration and analysis, reducing data silos and improving decision-making capabilities.
Components and Features
Azure Synapse architecture is built on a robust set of components that work together seamlessly. The key components include Azure Synapse, Azure Files, Event Hubs, Blob Storage, Azure Data Lake Storage, Azure DevOps, Power BI, Data Factory, Azure Bastion, Azure Monitor, Microsoft Defender for Cloud, Azure Key Vault, and more.
Azure Synapse uses a scale-out architecture to distribute computational processing of data across multiple nodes, with compute separate from storage. This enables you to scale compute independently of the data in your system.
Here are some of the key services used in Azure Synapse architecture:
- Azure Synapse Analytics
- Azure Data Lake Gen2
- Azure Cosmos DB
- Azure Cognitive Services
- Azure Machine Learning
- Azure Event Hubs
- Azure IoT Hub
- Azure Stream Analytics
- Microsoft Purview
- Azure Data Share
- Microsoft Power BI
- Microsoft Entra ID
- Microsoft Cost Management
- Azure Key Vault
- Azure Monitor
- Microsoft Defender for Cloud
- Azure DevOps
- Azure Policy
- GitHub
A dedicated SQL pool in Azure Synapse architecture is a fully managed, cloud-based, and optimized data warehouse that provides an enterprise-grade solution for managing and querying large datasets.
Dataflow
Dataflow is the backbone of any data-driven solution, and it's essential to understand how data flows through the system. Data is uploaded from various sources, including Azure Blob storage or Azure Files, and stored in the data landing zone.
Data can come from multiple sources, such as different factories, and is uploaded using a batch uploader program or system. Streaming data is captured and stored in Blob Storage using the Capture feature of Azure Event Hubs.
The arrival of data in the data landing zone triggers Azure Data Factory to process the data and store it in the data lake. The data lake is protected by firewall rules and virtual networks, blocking all connection attempts from the public internet.
Azure Data Lake stores raw data from different sources and is the home for data throughout the various stages of the data lifecycle. It's organized into different layers and containers, including the Raw layer, Enriched layer, and Curated layer.
Here's a breakdown of the data lifecycle in Azure Data Lake:
The data lake triggers the Azure Synapse pipeline, which converts data from the Bronze zone to the Silver Zone and then to the Gold Zone. Structured data in the gold zone is stored in Delta Lake format.
Components
Components play a crucial role in any data lakehouse solution, and understanding the key components is essential to get the most out of your solution. The following components are used in the data lakehouse solution:
- Azure Synapse
- Azure Files
- Event Hubs
- Blob Storage
- Azure Data Lake Storage
- Azure DevOps
- Power BI
- Data Factory
- Azure Bastion
- Azure Monitor
- Microsoft Defender for Cloud
- Azure Key Vault
Some of these components, like Azure Synapse and Power BI, are used to process and visualize data, while others, like Azure Files and Blob Storage, are used to store data. Azure DevOps is used for continuous integration and deployment, and Azure Monitor is used for monitoring and troubleshooting.
Cloud Native HTAP with Cosmos DB and Dataverse
Azure Synapse Link for Azure Cosmos DB and Azure Synapse Link for Dataverse enable you to run near real-time analytics over operational and business application data.
You can access the Azure Cosmos DB analytical store and then combine datasets from your near real-time operational data with data from your data lake or from your data warehouse using a SQL Serverless query or a Spark Pool notebook.
This allows you to integrate operational and business data for a more comprehensive view of your business.
Here are the key benefits of using Azure Synapse Link for Azure Cosmos DB and Azure Synapse Link for Dataverse:
- Run near real-time analytics over operational and business application data
- Access the Azure Cosmos DB analytical store and combine datasets with data from your data lake or data warehouse
- Use a SQL Serverless query or a Spark Pool notebook
You can also access the selected Dataverse tables and combine datasets from your near real-time business applications data with data from your data lake or from your data warehouse using a SQL Serverless query or a Spark Pool notebook.
Data Distribution and Management
In Azure Synapse, a distribution is the basic unit of storage and processing for parallel queries that run on distributed data in dedicated SQL pool. Each of the 60 smaller queries runs on one of the data distributions.
Each Compute node manages one or more of the 60 distributions. A dedicated SQL pool with maximum compute resources has one distribution per Compute node.
A dedicated SQL pool with minimum compute resources has all the distributions on one compute node. This means that with more compute resources, you can distribute the workload more efficiently.
To shard data into a hash-distributed table, dedicated SQL pool uses a hash function to deterministically assign each row to one distribution. The hash function uses the values in the distribution column to assign each row to a distribution.
A hash distributed table can deliver the highest query performance for joins and aggregations on large tables. The number of table rows per distribution varies as shown by the different sizes of tables.
Performance considerations for the selection of a distribution column include distinctness, data skew, and the types of queries that run on the system.
Data Sources and Integration
You can use Azure Synapse pipelines to pull data from a wide variety of semi-structured data sources, both on-premises and in the cloud.
When it comes to organizing your data lake, it's essential to follow best practices around which layers to create, what folder structures to use in each layer, and what files format to use for each analytics scenario.
To stage data from semi-structured data sources, use a Copy data activity to save data in the original format, as acquired from the data sources.
Here are some common semi-structured data sources you can integrate with Azure Synapse:
- Semi-structured data sources like CSV, Parquet, or JSON files
- Raw data lake layer for storing raw data
- Azure Data Lake Store Gen 2 for storing data
You can also use Data Explorer pools to easily ingest, consolidate, and correlate logs and IoT events data across multiple data sources, ideal for near real-time telemetry and time-series analytics scenarios.
Semi-Structured Data Sources
Semi-structured data sources can be a challenge to work with, but Azure Synapse pipelines make it easier to pull data from a wide variety of sources, both on-premises and in the cloud.
You can use Copy data activity to stage the data copied from semi-structured data sources into the raw layer of your Azure Data Lake Store Gen 2 data lake, and save data to preserve the original format.
Organizing your data lake following best practices is key, so create the right folder structures and use the right file formats for each analytics scenario.
Data flows, SQL serverless queries, or Spark notebooks can be used to validate, transform, and move your datasets into your Curated layer in your data lake.
For near real-time telemetry and time-series analytics scenarios, Data Explorer pools can ingest, consolidate, and correlate logs and IoT events data across multiple data sources.
With Data Explorer pools, you can use Kusto queries (KQL) to perform time-series analysis, geospatial clustering, and machine learning enrichment.
Here are the different tools you can use to work with semi-structured data sources:
- Data flows
- SQL serverless queries
- Spark notebooks
- Data Explorer pools
Your final dataset can be served directly from the data lake Curated layer or ingested into your SQL pool tables using the COPY command for fast ingestion.
Non-Structured Data Sources
Non-structured data sources can be a treasure trove of information, but they require some extra effort to organize and process.
You can use Azure Synapse pipelines to pull data from a wide variety of non-structured data sources, both on-premises and in the cloud. This includes raw data from sources like Office documents, PDFs, images, audio, forms, and web pages.
Organizing your data lake by following best practices will help you create a clear and logical structure. This includes creating specific layers, folder structures, and file formats for each analytics scenario.
To stage the data copied from non-structured data sources, use a Copy data activity in Azure Synapse pipeline to copy the data into the raw layer of your Azure Data Lake Store Gen 2 data lake.
Spark notebooks can be used to validate, transform, and enrich your datasets, moving them from the Raw layer through the Enriched layer and into your Curated layer in the data lake.
The final dataset can be served directly from the data lake's Curated layer or ingested into your data warehouse tables using the COPY command for fast ingestion.
Identity & Access Control
Identity & Access Control is a crucial aspect of Azure Synapse Architecture. It involves managing access to different layers of the system, ensuring that users have the right level of access to critical resources.
To achieve this, Azure Synapse uses native controls and drives simplicity, as suggested by the security design principles of the Azure Well-Architected Framework. This means using the Microsoft Entra user Account for the end user in the application and Azure Synapse DB access layers.
Fine-grained access control is provided using native first-party IAM solutions. This approach is more secure and easier to manage than other methods.
Least-privileged access is also a guiding principle in Azure Synapse. This means providing just-in-time and just enough access to critical resources, as per the Zero Trust principle.
To enhance security in the future, Microsoft Entra Privileged Identity Management (PIM) can be used.
Linked services define the connection information needed for a service to connect to external resources. Securing linked services configurations is essential to prevent unauthorized access.
Here's a summary of the identity and access control components in Azure Synapse:
- Microsoft Entra user Account for end user in application and Azure Synapse DB access layers
- Managed identity in Azure Synapse for Azure Synapse access external resource layer and Data Lake access layer
Cost Optimization and Deployment
Cost optimization is a key benefit of the Azure Synapse architecture, allowing you to scale your compute and storage levels independently. This means you only pay for what you use, with compute resources charged based on usage and storage resources billed per terabyte.
To estimate costs, use the Azure Pricing Calculator. The ideal individual pricing tier and total overall cost of each service included in the architecture depends on the amount of data to be processed and stored, as well as the acceptable performance level expected.
Azure Synapse Serverless SQL, Apache Spark in Azure Synapse, Azure Synapse Pipelines, and Azure Data Lakes all use consumption-based billing, so you only pay for what you use. This can help reduce unnecessary expenses and improve operational efficiencies.
Here's a breakdown of how each service is priced:
Cost Optimization
Cost optimization is a crucial aspect of any data-driven solution, and the data lakehouse solution is no exception. It's designed to be cost-efficient and scalable, with most components using consumption-based billing and autoscaling.
Data is stored in Data Lake Storage, and you only pay for what you use. This means you can scale up or down as needed, without incurring unnecessary costs. Pricing for the solution depends on the usage of key resources like Azure Synapse Serverless SQL, Apache Spark in Azure Synapse, and Azure Synapse Pipelines.
These resources use consumption-based billing, so you only pay for what you use. Private Link also uses consumption-based billing, but Power BI costs are based on the license you purchase. The Azure Pricing Calculator can help you estimate the cost of the solution.
The cost of security protection solutions varies, so it's essential to choose the right solution based on your business needs and costs. Use the guide below to learn more about how each service is priced:
By understanding how each service is priced, you can make informed decisions about your data lakehouse solution and optimize costs.
Deploy This Scenario
To deploy this scenario, you can use the companion repository available in GitHub. This repository shows how to automate the deployment of the services covered in the architecture.
The deployment guide for Azure analytics end-to-end with Azure Synapse is available, which provides detailed instructions and multiple deployment options. By following this guide, you can deploy the architecture to your subscription.
Frequently Asked Questions
How does Azure Synapse work?
Azure Synapse works by combining enterprise data warehousing and Big Data analytics, allowing you to query data at scale using either serverless or dedicated resources. It gives you the flexibility to analyze data on your terms, without limitations.
Is Azure Synapse a PaaS or SaaS?
Azure Synapse is a Platform-as-a-service (PaaS) offering, providing a managed environment for analytics workloads. It's not a Software-as-a-service (SaaS) solution, but rather a cloud-based platform for data integration and analytics.
What are the three components of Azure Synapse analytics?
Azure Synapse Analytics combines three key components: data integration, enterprise data warehousing, and big data analytics. These components work together to provide a unified analytics service for businesses.
What type of system is Azure Synapse analytics?
Azure Synapse Analytics is a cloud-based analytics service that combines data warehouses, big data systems, and log analytics. It's an enterprise-grade platform that accelerates insights from various data sources.
Sources
- https://learn.microsoft.com/en-us/azure/architecture/example-scenario/analytics/secure-data-lakehouse-synapse
- https://www.sqlshack.com/understanding-azure-synapse-analytics-formerly-sql-dw/
- https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/overview-architecture
- https://learn.microsoft.com/en-us/azure/architecture/example-scenario/dataplate2e/data-platform-end-to-end
- https://hevodata.com/learn/azure-synapse-architecture/
Featured Images: pexels.com