Streamlining Data Ingestion with Azure Dataverse and Databricks

Author

Reads 778

Man pointing on Documents
Credit: pexels.com, Man pointing on Documents

Azure Dataverse is a powerful tool for managing and integrating data from various sources, and when combined with Databricks, it becomes a game-changer for data ingestion.

By using Azure Dataverse with Databricks, you can easily connect to multiple data sources, including databases, files, and cloud services, and unify them into a single, cloud-based repository.

This streamlined approach to data ingestion enables faster and more accurate data analysis, which is essential for making informed business decisions.

With Azure Dataverse and Databricks, you can also automate data processing and transformation, reducing manual effort and minimizing errors.

Curious to learn more? Check out: Azure Data Studio vs Azure Data Explorer

Getting Started

Azure Dataverse is a great platform to get started with, and here's a brief overview of what you need to know.

To get started with Azure Dataverse, you'll first need to create an account on the Azure portal. This will give you access to the Dataverse platform.

Azure Dataverse is built on top of the Microsoft Dataverse database, which is a cloud-based database service that stores structured and semi-structured data. It's a great place to store and manage your data.

Credit: youtube.com, Overview of Microsoft Dataverse

The Dataverse platform allows you to create and manage databases, tables, and fields, as well as import and export data. You can also create custom apps and workflows using the Dataverse platform.

Dataverse provides a range of features to help you manage your data, including data validation, data encryption, and data backup and recovery. These features help ensure the integrity and security of your data.

Data Ingestion

Azure Synapse Link for Dataverse is the primary tool for accessing data in Microsoft Dynamics and exporting it in Common Data Model format (CDM). It's commonly used to process data in Azure Databricks to gain insights and prepare it for downstream consumption.

You can ingest CDM data in Databricks using several methods, including Azure Synapse Link with ADLS, Delta Lake format, and Azure Data Factory. However, these alternatives require setting up and maintaining additional services, representing complex and costly solutions.

To implement incremental ingestion of CDM data, you can leverage the existing Link with ADLS. This can be achieved by using a cost-effective solution that meets business requirements, such as the one provided in a Github repository.

Here are some common challenges faced when reading CDM data:

  • Azure Synapse Link for Dataverse
  • Azure Synapse Link for Dataverse with ADLS Gen2
  • Metadata file (model.json)
  • Table Data Folders
  • Option Set Files
  • Advanced Configurations
  • Data Partitioning
  • In-place Updates vs. Append-only

Efficiently Ingesting CDM Tables with Databricks

Credit: youtube.com, Data Ingestion using Auto Loader

Ingesting Common Data Model (CDM) tables with Databricks can be a complex task, especially when dealing with large data volumes and near real-time use cases. Azure Synapse Link for Dataverse is the primary tool for accessing data in Microsoft Dynamics and exporting it in CDM format.

The Azure Synapse Link for Dataverse uses Azure Databricks as the platform to process this data, but it can be challenging to process the data cost-effectively and on time. This is where a cost-effective solution can help meet business requirements.

There are several ways to leverage the Dataverse Link to export tables to ADLS and to further ingest them in Databricks. One option is to use Azure Synapse Link with ADLS in combination with Azure Data Factory, but this requires setting up and maintaining additional services.

A more efficient approach is to use Azure Synapse Link with ADLS and Append-only mode in Databricks. This option is simpler and less costly than the previous one.

Credit: youtube.com, Data Ingestion using Databricks Autoloader | Part I

Here are some key considerations when ingesting CDM data in Databricks:

The data is typically ingested in UTF-8 multiline CSV files, with no header, using the comma (,) as a field delimiter, and double-quotes (“) to quote text fields.

Azure Synapse Link is a powerful tool that allows you to export data from Microsoft Dataverse to Azure Synapse Analytics and Azure Data Lake Storage (ADLS) Gen2 in near real-time. It seamlessly integrates with Synapse by creating ready-to-use tables in the target workspace.

The Azure Synapse Link for Dataverse, formerly known as Export to data lake, facilitates near real-time insights over the data in the Microsoft Dataverse. It supports initial and incremental writes for data and metadata of standard and custom tables, automatically pushing changes from Dataverse to the destination without manual intervention.

With Azure Synapse Link, you can export data to Azure Synapse Analytics and ADLS Gen2 in the Common Data Model (CDM) format. This allows for seamless analytics, business intelligence, and machine learning applications on Dataverse data.

Consider reading: Azure Lake

Credit: youtube.com, How to ingest data from D365 to Azure Synapse Analytics?

To create a link from one environment to multiple Azure Synapse Analytics workspaces and Azure data lakes in your Azure subscription, you can follow these steps:

  • Sign in to Power Apps and select your preferred environment.
  • On the left navigation pane, select Azure Synapse Link.
  • On the command bar, select + New link.
  • Select the Connect to your Azure Synapse workspace option.
  • Select the Subscription, Resource group, Workspace name, and Storage account.
  • Add the tables you want to export, and then select Save.

The link creates the following elements in the target container:

  • The metadata file (model.json), providing a list of tables exported to the data lake.
  • A folder for each table, including near real-time data and read-only snapshot data.
  • A folder (Microsoft.Athena.TrickleFeedService) containing the option set files, one for each table.

Azure Synapse Link for Dataverse with ADLS Gen2 connects the Dataverse data to a Storage Account in the same tenant and region of the Power Apps environment that creates it. When you create a new link or add new tables to an existing link, the service syncs to the target storage account. A few hours later, the incremental updates start.

Data Management

Dataverse offers a secure and cloud-based storage option for your data, making it easy to manage and secure. Both the metadata and data are stored in the cloud, so you don't need to worry about the details of how they're stored.

You can easily integrate data from multiple sources into a single store using Dataverse, allowing you to take advantage of data from other applications in Power Apps. This can be done through scheduled integration, transforming and importing data using Power Query, or a one-time import of data from Excel and CSV files.

Credit: youtube.com, Azure Data Factory - SQL Server to Dataverse

Dataverse provides a rich metadata experience, with data types and relationships used directly within Power Apps. This enables you to define calculated columns, business rules, workflows, and business process flows to ensure data quality and drive business processes.

Here are some key benefits of using Dataverse for data management:

  • Easy to manage and secure
  • Supports scheduled integration with other systems
  • Allows for transforming and importing data using Power Query
  • Supports one-time import of data from Excel and CSV files

Synapse Workspace Login

To log in to your Synapse workspace, start by signing in to Power Apps and selecting your preferred environment. This is the first step to establishing a connection between Dataverse and your Synapse workspace.

You'll need to navigate to the left pane and select Azure Synapse Link, which might be hidden under the …More option. If you don't see it, try selecting Discover all and look for it in the Data Management section.

Once you've found Azure Synapse Link, click on the + New link button on the command bar to initiate the connection process. You'll then need to select the Connect to your Azure Synapse workspace option to proceed.

Credit: youtube.com, Intro to Azure Synapse Analytics: Create Synapse Workspace

To complete the connection, you'll need to provide your Subscription, Resource group, Workspace name, and Storage account information. Make sure your Synapse workspace and storage account meet the requirements outlined in the Prerequisites section.

Here's a quick rundown of the required information:

By following these steps, you'll establish a secure connection between Dataverse and your Synapse workspace, allowing you to export data and meet your data management needs.

Ingesting CDM Data

Ingesting CDM Data is a crucial step in data management, and there are several ways to do it efficiently.

One way to ingest CDM Data is by using the Azure Synapse Link for Dataverse, which can export data to Azure Synapse Analytics and Azure Data Lake Storage (ADLS) Gen2 in the Common Data Model (CDM) format.

The Azure Synapse Link for Dataverse is the primary tool for accessing data in Microsoft Dynamics and exporting it in Common Data Model format (CDM).

Credit: youtube.com, CDM Tutorial | Data Standards in Clinical Data Management

There are several alternatives to using the Azure Synapse Link for Dataverse, including Azure Synapse Link for Synapse with a Synapse Workspace using Incremental Folder Update Structure, Azure Synapse Link exporting in Delta Lake format, and Azure Synapse Link with ADLS in combination with Azure Data Factory.

However, these alternatives require setting up and maintaining additional services, which can be complex and costly.

A simpler alternative is to use the Azure Synapse Link with ADLS with Append-only and Databricks, which can ingest CDM data in Databricks without the need for additional services.

To ingest CDM data in Databricks, you can use the Azure Synapse Link for Dataverse with ADLS Gen2, which creates a link between Dataverse and a Storage Account in the same tenant and region of the Power Apps environment.

The service writes the data files to a container called the dataverse-environmentName-organizationUniqueName, where environmentName and organizationUniqueName are the names of the Power Apps environment and organization you used to create the link respectively.

Here are some key facts to keep in mind when ingesting CDM data:

  • Azure Synapse Link for Dataverse exports data to Azure Synapse Analytics and Azure Data Lake Storage (ADLS) Gen2 in the Common Data Model (CDM) format.
  • The Azure Synapse Link for Dataverse with ADLS Gen2 creates a link between Dataverse and a Storage Account in the same tenant and region of the Power Apps environment.
  • The service writes the data files to a container called the dataverse-environmentName-organizationUniqueName.
  • The metadata file (model.json) provides a list of tables exported to the data lake.
  • A folder for each table includes near real-time data and read-only snapshot data.
  • A folder (Microsoft.Athena.TrickleFeedService) contains the option set files, one for each table.

Table Data Folders

Credit: youtube.com, Create Multiple Folders From Excel Data

Table Data Folders are a crucial aspect of managing data, and understanding how they work is essential for effective data management.

The Dataverse Link service creates a data folder for each table, which contains two types of data files: near real-time data files and snapshot data files.

Near real-time data files are incrementally synchronized from the Dataverse by detecting what data has changed since the initial extraction or the last synchronization.

Snapshot data files are a read-only copy of the data updated every hour, with only the latest five snapshot files retained.

Dataverse data can continuously change through CRUD transactions, making snapshot files a reliable point in time version of the data.

However, near real-time files are continuously updated by the trickle feed engine of Dataverse, requiring an analytics consumer that supports the reading of files modified while the query is running.

Here's a key point to note: snapshot files are updated every hour if and only if there is an update in the corresponding near real-time file.

Credit: youtube.com, Knowledge clip: Keeping research data organized

The Dataverse engine also updates the model.json file to point to these snapshots and removes stagnant snapshot files to retain only the latest five.

As the CSV file has no header, you must refer to the schema stored in the model.json file to infer the correct schema.

Here's a summary of what happens when changes are made to the table:

  • A new column is added, and it's appended at the end of the row, with only the new rows showing the newly added column.
  • A column is deleted, and the resulting column isn’t dropped from the file but is preserved for the existing rows and marked as null (empty) for the new rows.
  • A data type change is a breaking change, requiring you to unlink and relink the table and then reprocess all the data.
  • Dataverse data uses different date and time formats based on the field, so be sure to check the table below for a reference to cater accordingly.

Data Partitioning

Data Partitioning is a key aspect of managing data in Azure Synapse Link. By default, tables are partitioned by month, but you can also choose to partition them yearly based on volume and data distribution.

Partitioning configuration is a per-table setting. This means you have control over how your data is organized and stored.

The data is written in several files instead of a single one, based on the createdOn value on each row in the source. This helps distribute the data evenly and makes it easier to manage.

For tables without the createdOn attribute, each partition contains 5,000,000 records. This can help you plan and prepare for data storage needs.

Credit: youtube.com, Data Partitioning Explained!

A new file is created when the partition file approaches 95% of the maximum number of blocks in a blob (50,000), appending a three-digit number suffix, e.g., _001, _002, etc.

To avoid duplicated rows due to network delays, you can use the SinkModifiedOn attribute to remove duplicates and keep the latest row. This is especially useful when dealing with missing acknowledgments from ADLS.

In-Place Updates vs. Append-Only

In-place updates and append-only are two configuration options for Azure Synapse Link tables.

In-place update is the default setting for tables with the createdOn attribute, and it performs an in-place upsert of the incremental data in the destination.

This means it scans the partition files to update or remove existing rows and append newly inserted ones.

Append-only, on the other hand, always appends the create, update, and delete changes at the end of the relative partition file.

It's the default setting for tables without the createdOn attribute, and it also defaults to a partition strategy of Year, which cannot be modified.

Additional reading: Azure Update

Credit: youtube.com, Append is Near: Log-based Data Management on ZNS SSDs

The documentation recommends using Append only to perform incremental ingestion to another target, enabling AI and ML scenarios.

Here's a summary of the key differences between in-place updates and append-only:

  • In-place update: Default for tables with createdOn attribute, performs in-place upsert of incremental data
  • Append-only: Default for tables without createdOn attribute, appends create, update, and delete changes at end of partition file

Data Management

Data Management is a crucial aspect of any organization, and having the right tools and strategies in place can make all the difference. One of the key benefits of using Dataverse is that it allows you to easily manage both metadata and data in the cloud, so you don't have to worry about the technical details.

Dataverse provides a secure storage option for your data, with role-based security that allows you to control access to tables for different users within your organization. This means you can ensure that sensitive data is only accessible to those who need it.

Logic and validation are also key features of Dataverse, allowing you to define business rules, workflows, and business process flows to ensure data quality and drive business processes. For example, you can create business rules to validate data across multiple columns and tables, and provide warning and error messages to users.

Explore further: The Azure Key

Credit: youtube.com, What is the Difference Between Data Management and Data Governance?

To give you a better idea of the benefits of Dataverse, here are some of the key features:

  • Easy to manage – Both the metadata and data are stored in the cloud.
  • Easy to secure – Data is securely stored so that users can see it only if you grant them access.
  • Access your Dynamics 365 Data – Data from your Dynamics 365 applications is also stored within Dataverse.
  • Rich metadata – Data types and relationships are used directly within Power Apps.
  • Logic and validation – Define calculated columns, business rules, workflows, and business process flows to ensure data quality and drive business processes.
  • Productivity tools – Tables are available within the add-ins for Microsoft Excel to increase productivity and ensure data accessibility.

Terminology Updates

As part of our efforts to improve the usability of Dataverse, we've updated some terminology to make it more intuitive and productive.

These updates were made in response to customer feedback and user research, and they're being rolled out across Microsoft Power Platform.

Dataverse now uses the term "table" instead of "entity", and "column" instead of "field" or "attribute".

The term "row" is now used instead of "record", and "choice" replaces "option set" or "picklist".

Here's a summary of the terminology updates:

It's worth noting that these terminology updates don't apply to APIs or messages in the Dataverse web services.

Frequently Asked Questions

What is a Dataverse in Azure?

A Dataverse in Azure is a secure storage system for business data, organized into tables with rows and columns. It's a centralized hub for managing and accessing data used by various business applications.

What is the difference between Azure Dataverse and data Lake?

Azure Data Lake Storage is faster for large data transactions due to its simplified storage process, unlike Dataverse which has more rules to check. This results in increased writing speed to a destination.

Is Dataverse an Azure SQL database?

No, Dataverse is not just an Azure SQL database, it's a more comprehensive platform that goes beyond traditional database capabilities. Learn more about what sets Dataverse apart from a standard database.

Is Dynamics 365 the same as Dataverse?

Dynamics 365 and Dataverse are related but distinct components, with Dynamics 365 apps relying on Dataverse for data management and Dataverse benefiting from Dynamics 365 functionality. While they are connected, they serve different purposes in the Microsoft ecosystem.

Calvin Connelly

Senior Writer

Calvin Connelly is a seasoned writer with a passion for crafting engaging content on a wide range of topics. With a keen eye for detail and a knack for storytelling, Calvin has established himself as a versatile and reliable voice in the world of writing. In addition to his general writing expertise, Calvin has developed a particular interest in covering important and timely subjects that impact society.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.