Azure Data Hub Architecture and Key Features

Author

Reads 712

Man in White Dress Shirt Analyzing Data Displayed on Screen
Credit: pexels.com, Man in White Dress Shirt Analyzing Data Displayed on Screen

Azure Data Hub is a comprehensive data management platform that enables organizations to integrate, process, and analyze large amounts of data from various sources. It's designed to handle complex data workflows and provide real-time insights.

At its core, Azure Data Hub is built on a scalable architecture that allows for seamless integration with various Azure services. This architecture is based on a hub-and-spoke model, where the hub is the central data repository and the spokes are the various data sources and sinks.

One of the key features of Azure Data Hub is its ability to process and analyze large amounts of data in real-time. This is made possible by its use of distributed computing and parallel processing techniques.

Here's an interesting read: Enterprise Data Lake Architecture

Azure Data Hub Architecture

The Azure Data Hub Service is hosted in the cloud via Virtual Machines inside a Virtual Network. This setup is managed by MarkLogic, freeing up concerns for clients.

A load balancer sits in front of the network instance to coordinate with all incoming transactions, allowing smooth communication between the ever-changing numbers of MarkLogic servers. Auto-scaling configurations automatically increase and decrease the number of resources as usage spikes and drops.

The network instance can be configured as publicly accessible or private, requiring peering to establish a connection with another VNet.

Hub Service Architectural Overview

Credit: youtube.com, Best practices for implementing a modern data architecture with Azure

The Data Hub Service on Azure is hosted in the cloud via Virtual Machines inside a Virtual Network. This network instance employs auto-scaling configurations to automatically increase and decrease the number of resources as usage spikes and drops.

A load balancer sits in front to coordinate with all incoming transactions, allowing smooth communication between the ever-changing numbers of MarkLogic servers.

The VNet can be configured as something publicly accessible, or alternatively, a private VNet requires peering to establish a connection with another VNet. If you're interested in learning more about a private Data Hub Service instance, you can visit Setting Up MarkLogic Data Hub Service with a Private VNet.

To sign up for Data Hub Service on Azure, you'll need to create an Azure account first. You can do this by sending a request to your MarkLogic representative or contacting them directly.

Prerequisite

To set up an Azure Data Hub, you'll need to create a DataHub Application within the Azure AD Portal. This application requires specific permissions to read your organization's Users and Groups.

The required permissions are Application permissions, including Group.Read.All, GroupMember.Read.All, and User.Read.All.

Config Details

Credit: youtube.com, DataHub Setup - Azure Virtual Machine Creation - NTT DATA Databytes Ep 33

To configure Azure Data Hub, you should first create a DataHub Application within the Azure AD Portal, which requires permissions to read your organization's Users and Groups.

The JSONSchema for the configuration is inlined in the YAML recipe, but you won't need to worry about that unless you're working with nested fields, which are denoted by a dot (.).

To add the necessary permissions, navigate to the permissions tab in your DataHub application on the Azure AD portal and select the Application permission type.

You'll need to grant the application permission to read Users and Groups, which you can do by clicking on the Endpoints button in the Overview tab to view the necessary endpoints to configure.

Data Hub Features

The Data Hub is available from the Synapse Studio left menu and is represented by the database cylinder icon. It provides quick access to your workspace and linked data stores through convenient Action (context) menus.

Credit: youtube.com, Real-time data streaming with Event Hubs on Azure Government

You can add a new SQL database, connect to external data, create integration datasets, or browse the Knowledge Center gallery from the + menu in the Data Hub header. Over 95 connectors to various data-centric storage technologies are available in the Linked tab of the Data Hub blade.

In the Linked tab, you can find linked external datasets and integration datasets used in data flows and pipelines, as well as sample data obtained from Azure Open Datasets.

Creating a Serverless SQL Pool

Creating a serverless SQL pool is a straightforward process within the Data Hub. Select SQL database from the + menu on the Data Hub blade.

You'll then see a Create SQL database blade appear to the right, giving you the choice between a serverless or dedicated SQL pool type. Choose the serverless SQL pool type.

Next, you'll need to name a database associated with that serverless pool. Select the Create button to deploy the serverless SQL Pool.

After a few minutes, refresh the Workspace tab in the Data Hub blade to view the newly created database.

Broaden your view: Create Tenant Azure

Augmenting with Open Datasets

Credit: youtube.com, NWDS Talk - Dataset Search for Data Discovery, Augmentation, and Explanation

Augmenting with Open Datasets is a powerful feature in Azure Synapse. You can link your existing data to Azure Open Datasets to gain new insights.

To get started, navigate to the Data Hub and expand the + menu. Select Browse gallery to explore available datasets. The Datasets tab in the gallery is where you'll find the Bing COVID-19 Data card. Choose this card and select the Continue button to proceed.

An informational screen will display, along with a preview of the data you can expect from the dataset. Select the Add dataset button to include this data in Azure Synapse. The COVID-19 data will now be available in the Data Hub under the Linked tab beneath the Azure Blob Storage heading.

To preview the data, expand the Azure Blob Storage item and the actions menu next to the bing-covid-19-data folder. Select New SQL script and then Select TOP 100 rows. This will give you a snapshot of the data within the dataset.

You might like: Open Source Data Lake

Extracting Groups

Credit: youtube.com, User and Group Management in DataHub: Aug 27 2021 Community Town Hall

Extracting Groups is a powerful feature in Data Hub that allows you to extract group membership information from Azure AD.

This connector extracts the edges between Users and Groups that are stored in Azure AD, mapping them to the GroupMembership aspect associated with DataHub users (CorpUsers).

Be aware that this has the unfortunate side effect of overwriting any Group Membership information that was created outside of the connector.

If you've used the DataHub REST API to assign users to groups, this information will be overridden when the Azure AD Source is executed.

If you intend to always pull users, groups, and their relationships from your Identity Provider, then this should not matter.

Schema Registry

A centralized repository for managing schemas of event streaming applications comes free with every Event Hubs namespace.

Azure Schema Registry integrates with your Kafka applications or Event Hubs SDK-based applications, ensuring data compatibility and consistency across event producers and consumers.

For more insights, see: Azure Event Hub Cost

Credit: youtube.com, Apache Kafka 101: Schema Registry (2023)

Schema Registry enables schema evolution, validation, and governance, promoting efficient data exchange and interoperability.

It supports multiple schema formats, including Avro and JSON schemas, making it a versatile tool for managing event streaming data.

By using Schema Registry, you can easily manage and govern your event streaming data, ensuring it's accurate and consistent across your applications.

Core Functions and Kafka SDKs

Event Hubs supports the industry-standard AMQP 1.0 protocol, which has a broad ecosystem of available tools and technologies.

You can use client languages like .NET, Java, Python, and JavaScript to start processing your streams from Event Hubs, with low-level integration provided by all supported client languages.

Event Hubs integrates with Azure Functions for serverless architectures, allowing you to process your streams in real-time and get actionable insights.

A partitioned consumer model in Event Hubs enables multiple applications to process the stream concurrently, giving you control over the speed of processing.

SDKs are available for various languages, making it easy to integrate Event Hubs with your existing applications and workflows.

Intriguing read: Azure Eventhubs

Core Capabilities

Credit: youtube.com, Azure Data Factory patterns and best practices

Azure Data Hub is a powerful tool that offers several core capabilities. It provides a unified data platform that enables you to manage and process large amounts of data from various sources.

With Azure Data Hub, you can easily integrate data from different systems and applications, and use it to gain insights and make informed decisions. This is made possible through its support for multiple data formats and protocols.

One of the key benefits of Azure Data Hub is its ability to scale with your growing data needs. Whether you're dealing with small or large datasets, it can handle the load with ease.

HTAP with Synapse Link for Cosmos DB is a game-changer for businesses.

You can connect to your analytical store hosted in Azure Cosmos DB directly from Azure Synapse Analytics through Azure Synapse Link. This connection enables data to flow from Azure Cosmos DB to Azure Synapse without the use of ETL mechanisms.

Take a look at this: Azure Synapse Data Warehouse

Credit: youtube.com, Azure Cosmos DB + Azure Synapse Link | Cloud-Native HTAP

Azure Synapse Link provides a cloud-native HTAP capability that delivers near-real-time data into analytical queries, Power BI dashboards, and machine learning pipelines.

To enable Azure Synapse Link, open your Cosmos DB resource in the Azure Portal and select the Features item from beneath the Settings heading.

From the Features listing, select Azure Synapse Link and the Azure Synapse Link blade will appear on the right side of the screen. Select the Enable button to enable this feature.

The Connect to external data blade will appear on the screen's right. Select one of Azure Cosmos DB API options, such as the SQL API.

Name the linked service and connect to your Azure Cosmos DB resource and analytical store container using your desired authentication method. Once complete, refresh the Data Hub screen to see your HTAP enabled container located in the Linked tab under the Azure Cosmos DB section.

Query this data quickly by selecting a collection and expanding the Actions menu and selecting New SQL script, then Select TOP 100 rows.

Key Capabilities

Computer server in data center room
Credit: pexels.com, Computer server in data center room

Azure Event Hubs is a powerful tool, and one of its key capabilities is detecting deleted entities, which is optionally enabled via stateful ingestion.

This feature is particularly useful for identifying and handling deleted users, groups, and group memberships, which are extracted from Azure AD and mapped to the GroupMembership aspect associated with DataHub users (CorpUsers).

Here are some key entities that can be detected:

  • Users
  • Groups
  • Group Membership

Additionally, Azure Event Hubs allows you to have users ingested into DataHub before they log in for the first time, enabling actions like adding them to a group or assigning them a role.

Event Hubs

Event Hubs is a multi-protocol event streaming engine that natively supports Advanced Message Queuing Protocol (AMQP), Apache Kafka, and HTTPS protocols.

You can bring Kafka workloads to Event Hubs without making any code changes, eliminating the need to set up, configure, or manage your own Kafka clusters or use a Kafka-as-a-service offering that's not native to Azure.

Credit: youtube.com, Azure Event Hub Tutorial | Big data message streaming service

Event Hubs is built as a cloud native broker engine, allowing you to run Kafka workloads with better performance, better cost efficiency, and no operational overhead.

You can experience flexible and cost-efficient event streaming through the Standard, Premium, or Dedicated tiers for Event Hubs, catering to data streaming needs that range from a few MB/sec to several GB/sec.

Event Hubs uses a partitioned consumer model, enabling multiple applications to process the stream concurrently and letting you control the speed of processing.

With Event Hubs, you can ingest, buffer, store, and process your stream in real time to get actionable insights, and it integrates with Azure Functions for serverless architectures.

A broad ecosystem is available for the industry-standard AMQP 1.0 protocol, and SDKs are available in languages like .NET, Java, Python, and JavaScript, allowing you to start processing your streams from Event Hubs.

You can use the samples provided to stream data from your Kafka applications to Event Hubs, making it easy to integrate your existing infrastructure.

A unique perspective: Azure Data Factory Cost

Streaming and Analytics

Credit: youtube.com, How to analyze real-time data with Azure Stream Analytics | Azure Tips & Tricks

Streaming and analytics are closely tied together in Azure Data Hub. With Event Hubs, you can process streaming events in real-time using Stream Analytics. This allows you to develop a job using the no-code editor or SQL-based query language.

Event Hubs integrates with Azure Stream Analytics, enabling real-time stream processing. This integration is a game-changer for businesses that need to analyze data in real-time. You can develop a Stream Analytics job using drag-and-drop functionality without writing any code.

Azure Data Explorer is a fully managed platform for big data analytics that delivers high performance and allows for near real-time analysis of large volumes of data. By integrating Event Hubs with Azure Data Explorer, you can perform near real-time analytics and exploration of streaming data.

Event Hubs also allows you to capture streaming data for long-term retention and batch analytics. You can achieve this behavior on the same stream that you use for deriving real-time analytics. Setting up capture of event data is fast.

If you're already using Apache Kafka, you can stream data from your Kafka applications to Event Hubs using the provided samples. This integration makes it easy to bring your existing data into Azure Data Hub.

Check this out: Azure Adx

Streaming and Storage

Credit: youtube.com, Real Time Streaming with Azure Databricks and Event Hubs

You can process streaming events in real-time with Stream Analytics, which integrates with Event Hubs. This allows you to develop a Stream Analytics job without writing any code using the built-in no-code editor.

Event Hubs integrates with Azure Data Explorer for near real-time analytics and exploration of streaming data. With this integration, you can perform complex analyses on large volumes of data.

For long-term retention or micro-batch processing, you can capture your data in near real-time in Azure Blob Storage or Azure Data Lake Storage. Setting up capture of event data is fast.

Streaming data from Kafka applications to Event Hubs is also possible using the provided samples.

A unique perspective: What Is Azure Storage

Streaming with Event Hubs

Streaming with Event Hubs is a powerful way to handle large volumes of data in near real-time. Event Hubs is a cloud-native broker engine that supports Apache Kafka, AMQP, and HTTPS protocols.

You can integrate Event Hubs with Azure Stream Analytics for real-time stream processing, without writing any code. This allows you to develop a Stream Analytics job using a no-code editor or the SQL-based query language.

Credit: youtube.com, Azure Stream Analytics with Event Hubs

Event Hubs is also designed to handle a wide range of message sizes, up to 20 MB, with self-serve scalable dedicated clusters at no extra charge. This makes it a flexible and cost-efficient option for event streaming.

Here are some of the key benefits of using Event Hubs for streaming:

  • Supports Apache Kafka and AMQP protocols
  • Integrates with Azure Stream Analytics for real-time processing
  • Handles message sizes up to 20 MB
  • Offers flexible and cost-efficient event streaming options

Stream Using Event Hubs SDK

You can use the Event Hubs SDK to stream data to Event Hubs using various programming languages such as .NET Core, Java, Spring, Python, JavaScript, Go, C, and Apache Storm.

The Event Hubs SDK is a powerful tool that allows you to stream data to Event Hubs from your applications. You can use the SDK to send and receive large messages up to 20 MB in size.

Event Hubs supports a wide range of message sizes, from lightweight messages less than 1 MB to larger messages up to 20 MB. This capability ensures uninterrupted business operations.

Here are some programming languages you can use with the Event Hubs SDK:

  • .NET Core
  • Java
  • Spring
  • Python
  • JavaScript
  • Go
  • C (send only)
  • Apache Storm (receive only)

Streaming with Apache Kafka

Credit: youtube.com, Azure Event Hubs for Apache Kafka | Azure Friday

You can stream data by using Apache Kafka, which allows you to stream data from your Kafka applications to Event Hubs. This is a great option for those who already have Kafka applications set up.

Event Hubs offers flexible and cost-efficient event streaming through its Standard, Premium, or Dedicated tiers, making it easy to find a match that suits your data streaming needs.

To get started with streaming data from Kafka to Event Hubs, you can use the provided samples, which will help you set up the connection with ease.

Curious to learn more? Check out: Set Subscription Context Azure Powershell

Frequently Asked Questions

What is the difference between Azure DataHub and Data Lake?

Data Hub and Data Lake serve different purposes: Data Hub integrates and shares data for consistency and governance, while Data Lake stores raw data for advanced analytics and machine learning. Choose between them based on your data management needs

Jeannie Larson

Senior Assigning Editor

Jeannie Larson is a seasoned Assigning Editor with a keen eye for compelling content. With a passion for storytelling, she has curated articles on a wide range of topics, from technology to lifestyle. Jeannie's expertise lies in assigning and editing articles that resonate with diverse audiences.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.