Azure Data Catalog is a powerful tool for enterprise data management, allowing you to catalog, manage, and govern your data assets.
It helps you discover, understand, and use your data more effectively, reducing data waste and improving data-driven decision making. Azure Data Catalog supports over 100 data sources, including relational databases, big data stores, and cloud-based services.
With Azure Data Catalog, you can create a centralized repository of metadata, making it easier to find and use the right data for your business needs. This helps reduce data duplication and improves data quality.
By using Azure Data Catalog, you can also automate data discovery and classification, which can save you a significant amount of time and effort.
What Is
Azure Data Catalog is a cloud-based service that helps you discover, understand, and manage your data assets.
It allows you to catalog and classify your data sources, giving you a single source of truth for your organization's data.
Azure Data Catalog is built on top of Azure Active Directory, which means you can use your existing Azure AD credentials to access the service.
This integration makes it easy to manage permissions and access to your data catalog.
By using Azure Data Catalog, you can improve data governance, reduce data silos, and increase collaboration across your organization.
Azure Data Catalog provides a centralized location for data assets, making it easier to find and understand the data you need.
It also enables you to create data dictionaries, which are essentially metadata repositories that store information about your data.
Use Cases
Azure Data Catalog is a powerful tool that can benefit organizations in various ways. It can serve many users for a variety of purposes, but the two most common uses are for centralization of data information and for business intelligence.
As organizations grow, the amount of data collected can quickly become difficult to manage. Data inventory can make it harder to organize and less useful since many individuals may not even know it exists.
By registering data in Data Catalog, organizations can ensure that data is available to all relevant business units. This can help ensure that organizations can benefit from the shared knowledge and efforts of all their users and analysts.
Developing business intelligence requires the combination of many sources of data, including those not created for BI or analysis. Data sources are often distributed, making it harder to gather, standardize, or apply data to BI purposes.
Aggregating data with Azure Data Catalog can enable analysts to skip some or most of the manual work BI typically requires. Analysts can collaborate with both internal and external teams to identify sources and ensure that data is accurate and relevant.
Azure Data Catalog can also help ensure that multiple analyses aren’t required and that business units are all working from the same insights. This enables end users to contribute to and improve upon data which can then be used to refine BI.
How to Get Started
To get started with Azure Data Catalog, you can create it like any other Azure resource through the Azure portal. Go to the portal, search for Data Catalog, and mention a name for your data catalog.
You'll need to specify the subscription name, the location for the catalog, and the pricing tier, which is either free or standard edition. Then select Create.
The Data Catalog is a fully managed cloud service that acts as a central shared place in an organization for developers, analysts, data scientists, and users to contribute their knowledge and help to locate, understand, and consume data.
To publish data, go to the Azure Data Catalog home page and select Publish Data. Alternatively, you can go to the Azure Data Catalog provision page and type in Data Catalog Name, the subscription you may want to use, and the location for the catalog.
You can select the pricing edition, which is offered in two editions, and keep everything as default for the below categories. Your ID is automatically added as a catalog user and an administrator, and you can further add catalog users and catalog administrators to the catalog.
Registration and Source Management
To register a data source in Azure Data Catalog, you can select multiple sources to scan, as shown in the screenshot example. This can be done by clicking on "New" and connecting to the designated database, such as Azure SQL Database.
You can also launch the desktop application to register your data sources, which is a click-once application that makes the process easier. It's recommended to use the "Launch Application" option over "Create Manual Entry" for larger data sources.
Once you've registered your data source, you can select the tables to scan and schedule the scan to occur periodically or run it once. After the scan is finished, Azure Data Catalog will have identified the metadata and classification of the data assets.
Source Registration
You can register and scan data sources in Azure Data Catalog by clicking on New and connecting to your designated database.
Multiple sources can be selected to scan, and Azure SQL Database is used as an example in the screenshot.
To create a new scan, select the tables you want to scan, and you can schedule the scan to occur periodically or let it run once.
The schedule is preferable when the data source is volatile, and when a scan is finished, Azure Data Catalog will have identified the meta data and the classification of the data assets.
Classifications are applied to formal data, usually driven by the government or consisting of fixed formats, such as email or phone number.
You can also add custom classification, which can apply to your data, and after adding a new classification, the scan can take place again to include the new classification on the data.
To register your data sources, you can use either the Launch Application option or the Create Manual Entry option.
However, I personally prefer the Launch Application option as it is a click-once application and easier to use.
Once installed, you'll be brought to the Sign in page, where you can sign in using the same credentials you use to access the catalog in the portal.
To register a data source, expand your database and select the objects you want to register in your data catalog.
You can select multiple objects at once by using a double right arrow (>>), and you can also include preview option to preview sample data later.
After registering your objects, you can click on VIEW PORTAL to discover your data and register more objects if needed.
Content Management
Content Management is a breeze with Azure SQL Database. You can consult, search, and edit content with ease.
To start, each asset in the SQL Database can be annotated with additional information such as description, owner, expert, and more. Just click on any given asset and you'll open a new screen where you can review and add information.
Let's take a closer look at the D_CUSTOMER table. By clicking edit, you can add a description and classifications. The hierarchy and origin of the table are also shown.
The Schema tab gives an overview of all the fields, and you can edit and add a description or custom classification. Fields recognized by the default classification are already set with the corresponding classification.
The related tab provides a great overview of all relationships to the selected object. You can see the designated object in the middle and links to all related objects, such as schema's, tables, columns, and more.
On the homepage, you'll find a search bar where you can fill in some key words and quickly get a list of all objects containing that keyword. This is a great way to find an object if you're not sure which source to consult.
Security and Features
Azure Data Catalog takes security seriously, and it shows in its robust feature set. One of the key features is Role Based Access Control, which consists of five distinct roles.
These roles are designed to manage permission to register and scan data sources, as well as define who can contribute to the Data Catalog. The five roles are:
- Catalog Administrator: Call all APIs on the catalog; is not an owner
- Data Source Administrator: Responsible for setting up scans
- Curator: Responsible for editing content
- Contributor: Read only access
- Automated data source process: currently primarily used for ADF to push lineage into the catalog.
Each role has its own set of permissions, ensuring that only authorized users can perform specific tasks. This level of control is crucial for maintaining data integrity and security.
Key Concepts and How-To
To get the most out of Azure Data Catalog, you need to understand its key concepts. Data discovery is a crucial aspect, making data searchable and available to users. This ensures that all registered data is discoverable.
Data understanding is also vital, as it makes data interpretable by including metadata and descriptions of the dataset content or format. Data consumption is the use of data by users, which can include different modes for data access and ingestion.
Data users fall into two main groups: producers and consumers. Producers are responsible for creating, registering, and maintaining data, while consumers use data for reporting, analysis, or distribution purposes.
To utilize Azure Data Catalog, you first need to create a Data Catalog account as a Resource in the Azure Subscription. Then, you can launch the Data Catalog account and manage your data sources.
Here are the key steps to register and scan a data source in Azure Data Catalog:
- Create a Data Catalog account as a Resource in the Azure Subscription.
- Launch the Data Catalog account and go to the external portal.
- Select "Manage your data" to go to the management of data sources.
To discover and annotate data sources in Azure Data Catalog, you can add a friendly name, description, and expert information in the Properties tab. You can also add meaningful descriptions and tags to all the columns present in the table in the Columns tab. Additionally, you can add documentation related to the data asset in the Documentation tab to provide a complete and detailed explanation.
Challenges and Best Practices
The Azure Data Catalog has its share of challenges. The UI is quite simple, but complex tasks require a more streamlined approach to prevent clutter.
A better data cataloging approach is necessary to make things more accessible. This could involve reorganizing the UI to prioritize essential features.
The Azure cloud service can be slow when uploading large datasets, and the uploading time is not consistent. To improve this, the cloud server's capacity and the connection between the client and server need to be enhanced.
Regular updates are crucial to ensure the client receives bug fixes and new features. However, Azure Data Catalog has not been updated in a while, which is a concern.
A better integration of Power BI with ADC is also necessary to improve the overall user experience.
What Are the Challenges?
The Azure Data Catalog has some challenges that need to be addressed. The UI is too simple, which can make the complex task of cataloging data feel overwhelming.
One of the main issues is the slow uploading speed of large datasets to the Azure cloud service. This can be frustrating, especially when the time it takes to upload is inconsistent.
Getting everything cataloged can be a daunting task, which is why a better data cataloging approach is needed. It's like trying to organize a huge library - it's a lot of work!
The Azure Data Catalog hasn't been updated in a while, which can make users feel like they're not getting the support they need. Regular updates are essential to fix bugs and add new features.
There's also a need for better integration of Power BI with the Azure Data Catalog. This would make it easier to use both tools together.
Here are the main challenges in a nutshell:
- The UI is too simple.
- The Azure cloud service is slow.
- Getting everything cataloged is a challenge.
- The Azure Data Catalog needs regular updates.
- Better integration with Power BI is needed.
Secondary Challenges
Data lineage is a crucial aspect of understanding the origin and flow of data to its destination, but Azure Data Catalog's metadata can change frequently, making manual updates a tedious task.
Metadata updates should be scheduled to keep the data catalog up-to-date.
Azure Data Catalog only supports a single catalog per organization, which can be limiting for growing organizations with multiple data sources.
The number of data catalogs per organization should be increased to accommodate different data sources.
Azure Data Catalog currently supports Azure Data Lake Storage Gen1, which is deprecated, instead of supporting Azure Data Lake Storage Gen2.
SnowFlake is a widely used database, and many clients are using it, so it would be beneficial for Azure Data Catalog to support metadata for SnowFlake.
A backup feature would be handy in case of data overwriting or removal, allowing the data catalog to be restored to a previous point.
The Data Catalog REST API has limitations related to asset root size, the number of annotations, the asset's overall size, and deleting an asset deletes all associated annotations.
Frequently Asked Questions
What is an Azure data catalog?
Azure Data Catalog is a centralized metadata repository that helps discover and understand data sources across an organization. It's a cloud-based service that simplifies data asset discovery and management for analysts, scientists, and developers.
What is included in a data catalog?
A data catalog includes metadata, data management tools, and search functionality to help users find and evaluate data. It serves as a centralized inventory of available data, making it easier to discover and utilize relevant information.
What is a Datastore in Azure?
A Datastore in Azure is a centralized storage of connection information for Azure services, allowing you to easily access and manage your storage resources. This simplifies the process of connecting to Azure storage services without needing to remember complex connection details.
Sources
- https://www.xenonstack.com/blog/azure-data-catalog
- https://bluexp.netapp.com/blog/azure-anf-blg-azure-data-catalog-understanding-concepts-and-use-cases
- https://www.element61.be/en/resource/get-more-value-your-enterprise-data-assets-azure-data-catalog-v2
- https://dataedo.com/sources/microsoft-azure
- https://www.sqlshack.com/getting-started-with-azure-data-catalog/
Featured Images: pexels.com