Data Modelling Azure for Scalable Analytics is a crucial step in leveraging the power of the cloud for data-driven decision making. This approach enables organizations to handle large volumes of data and scale their analytics capabilities as needed.
To achieve this, Azure provides a robust set of tools and services that support data modelling, including Azure Synapse Analytics and Azure Cosmos DB. These services offer flexible and scalable data storage and processing options, allowing organizations to choose the best approach for their specific needs.
With Azure Synapse Analytics, organizations can integrate data from various sources and perform complex analytics, while Azure Cosmos DB provides a globally distributed, multi-model database that supports high-throughput and low-latency data access. By leveraging these services, organizations can unlock new insights and drive business growth.
The result is a scalable analytics platform that can handle large volumes of data and support real-time analytics, enabling organizations to make data-driven decisions and stay ahead of the competition.
Data Modeling Basics
Normalized data models are not recommended for Azure Databricks, as they can lead to slower performance due to more joins and keys to keep in sync.
In a nonrelational database, data is stored denormalized and optimized for queries and writes. This means that data is organized in a way that makes it easy to access and retrieve the information you need.
To determine the best data model for your needs, consider how your data will be accessed. Ask yourself: are columns called together frequently in queries, and are they updated together? If so, it may make sense to denormalize your data to improve performance.
Here are some factors to consider when deciding whether to normalize or denormalize your data:
- How is your data accessed?
- Are columns called together frequently in queries?
- Are columns updated together?
When to Reference
When to reference data in your model is crucial for efficient performance. You should use normalized data models when representing one-to-many relationships.
In the context of a document database, this means referencing a stock item on a portfolio instead of embedding it, as shown in Example 1. This approach improves the efficiency of write operations, which happen frequently throughout the day.
Use normalized data models when related data changes frequently. This is because referencing data allows you to update only the single document that needs to be updated, rather than updating multiple documents.
Representing many-to-many relationships is another scenario where normalized data models shine. This is because referencing data enables you to connect multiple entities without having to embed all the data in one document.
Referenced data could be unbounded, meaning it could grow indefinitely. In such cases, referencing data is a better approach to avoid performance issues.
Here are some scenarios where you should use normalized data models:
- Representing one-to-many relationships.
- Representing many-to-many relationships.
- Related data changes frequently.
- Referenced data could be unbounded.
Foreign Keys
Foreign keys are an essential part of data modeling, and it's crucial to understand how they work in the context of your database.
In Azure Cosmos DB, foreign keys are essentially "weak links" between documents, meaning they won't be verified by the database itself. This means you need to ensure the data a document is referring to actually exists in your application or through server-side triggers or stored procedures.
To represent entity relationships, foreign keys are used as "relationship keys" with the suffix "SK". These keys can be either integer or GUID data types, providing a way for external tools to join entities.
Date properties have corresponding integer date key properties in the format YYYYMMDD, making it easy to query and manage related data.
Foreign keys are particularly useful when you have one-to-few relationships between entities, which is a great opportunity to use embedded data models. However, if you have one-to-many or many-to-many relationships, normalizing your data model is a better approach.
Here's a quick summary of the relationship key properties:
Flattening
Flattening is a crucial aspect of data modeling, especially when working with nested data structures. In the context of Azure Cosmos DB, all properties in the root level of your data will be represented as a column in the analytical store.
Nested structures can be a challenge, as they demand extra processing from Azure Synapse runtimes to flatten the data into a structured format. This can be a significant issue in big data scenarios.
The document will have only two columns in the analytical store, id and contactDetails. All other data, such as email and phone, will require extra processing through SQL functions to be individually read. This extra processing can be time-consuming and may impact performance.
Composite Entities
Composite entities are a powerful tool in data modeling, but they require careful consideration to ensure optimal performance. They're composed from simpler entities, often requiring more computing resources to generate, and may return larger result sets.
When deciding whether to use a composite entity, think about the specific scenarios it's designed to support. For example, WorkItemSnapshot combines WorkItemRevisions and Dates to focus on-trend data for a filtered set of work items.
To achieve the best performance, query the correct entity for your scenario. You shouldn't use a composite entity to query the current state of work items, for instance. Instead, use the WorkItems entity set to generate a more quick-running query.
Some entities may contain all historic values, while others may only contain current values. For example, WorkItemRevisions contains all work item history, which you shouldn't use in scenarios where the current values are of interest.
Here are some key considerations for composite entities:
- They're composed from simpler entities.
- They often require more computing resources to generate.
- They may return larger result sets.
- They're designed to support specific scenarios.
- Query the correct entity for your scenario to achieve optimal performance.
Entity Relationships
Entity relationships are a crucial aspect of data modelling in Azure.
You can model many-to-many relationships in a relational database with join tables, but this approach can be inefficient.
Join tables are used to glue together records from other tables, but they can add extra queries to your database.
Loading an author with their books or loading a book with its author would require at least two extra queries against the database.
Consider simplifying your data model by embedding the related data directly into the document.
This approach can reduce the number of server round trips your application has to make, making it a more efficient choice.
By embedding the related data, you can load an author and immediately know which books they've written.
Conversely, if you have a book document loaded, you would know the IDs of the authors.
Modeling Many-to-Many Relationships
Modeling many-to-many relationships in Azure Cosmos DB can be tricky, but it's not as complicated as it seems. You might be tempted to replicate the join table approach from relational databases, but this can lead to extra queries against the database.
Loading an author with their books or loading a book with its author would require at least two extra queries against the database. One query to the joining document and then another query to fetch the actual document being joined.
However, this is where things get interesting. Consider a data model where the author document directly contains the IDs of the books they've written, and vice versa. This saves that intermediary query against the join table, reducing the number of server round trips your application has to make.
In fact, if you have an author document, you immediately know which books they've written, and conversely if you have a book document loaded, you would know the IDs of the authors. This approach is more efficient and easier to work with.
Here are some key considerations to keep in mind:
- Denormalize the data to avoid extra queries.
- Embed entities when it makes more sense to do so.
- Consider the trade-offs of using JOINs in relational queries versus denormalizing the data in Azure Cosmos DB.
Data Modeling Techniques
In Azure Cosmos DB, data is stored denormalized and optimized for queries and writes, unlike relational databases where data is normalized and optimized for data integrity.
Normalization becomes meaningless with Azure Synapse Link, as you can join between containers using T-SQL or Spark SQL, reducing the need for normalization.
To determine the best data modeling technique for your Azure Cosmos DB, consider how your data is accessed and updated. Are columns called together frequently in queries, or are they updated together? This will help you decide whether to embed entities or use bridge tables.
Consider the Contoso Pet Supplies model, where entities are organized by entities and bridge tables are used to establish relationships between entities. This approach may not be the best for Azure Cosmos DB, where denormalization is an important data modeling technique.
Here are some data modeling techniques to consider:
- Denormalization: Embed entities to reduce the need for joins and improve query performance.
- Embedding entities: Store related data together to reduce the need for joins and improve query performance.
- Bridge tables: Use bridge tables to establish relationships between entities, but be aware that this approach may not be the best for Azure Cosmos DB.
Hybrid Models
Hybrid models are a great way to balance the pros and cons of embedded and referenced data models. They allow you to mix and match different approaches to suit your application's specific needs. This can lead to simpler application logic and fewer server round trips.
By combining embedded and referenced data, you can create a hybrid model that optimizes performance. For example, in Azure Cosmos DB, you can embed some data and reference other data, as shown in Example 2. This approach allows you to save round trips to the server and improve performance.
Hybrid models can also help you avoid the limitations of NoSQL stores that don't support multi-document transactions. With Azure Cosmos DB's support for transactions, you can use server-side triggers or stored procedures to insert and update related documents within a single ACID transaction. This enables you to create more complex data models without sacrificing performance.
Here are some key characteristics of hybrid models:
- Embed data that changes infrequently, such as author names and thumbnails.
- Reference data that changes frequently or is large in size, such as book documents.
- Use pre-calculated aggregates to save expensive processing on read operations.
- Take advantage of Azure Cosmos DB's support for multi-document transactions.
By applying these principles, you can create a hybrid data model that balances the benefits of embedded and referenced data models. This approach can help you optimize performance, simplify application logic, and improve overall data management.
Normalization
Normalization is a crucial aspect of data modeling, especially in a schema-free world. It helps reduce the size of your data footprint in both transactional and analytical stores.
By normalizing your data, you can achieve smaller transactions. In fact, normalization helps decrease the chances of parts of your data not being represented in the analytical store.
Fewer properties per document is a significant benefit of normalization. This, in turn, helps improve the performance of your analytical queries. However, there are limits to the number of levels and properties that are represented in the analytical store.
SQL serverless pools in Azure Synapse support result sets with up to 1,000 columns. Exposing nested columns also counts towards this limit, so be mindful of that when designing your data model.
Here are some key takeaways from normalization:
- Smaller data footprint in both transactional and analytical store.
- Smaller transactions.
- Fewer properties per document.
- Data structures with fewer nested levels.
Work Tracking Sets
Work tracking sets are a crucial aspect of data modeling in Azure Boards Analytics. They allow you to group and filter work items based on various criteria, such as area hierarchy or iteration paths.
The supported work tracking sets include Area, Iteration, BoardLocation, CalendarDate, Project, Process, Tag, Team, User, WorkItemBoardSnapshot, WorkItemLink, WorkItemRevision, WorkItemSnapshot, and WorkItem. These sets provide a comprehensive view of your work tracking data.
Each work tracking set has its own set of properties, which can be used for grouping and filtering. For example, the Area set includes properties for grouping and filtering by area hierarchy.
Here's a breakdown of the supported work tracking sets and their respective API versions:
Test and Sets
Test and Sets are an essential part of data modeling, and understanding how they work can make a big difference in your data analysis.
Entity types and entity sets are supported with the v3.0-preview or v4.0-preview Analytics version, which is a great starting point for your data modeling journey.
The TestConfiguration/ TestConfigurations entity type provides test plan configuration information, which is crucial for configuring tests. This entity type is supported in both v3.0-preview and v4.0-preview Analytics versions.
Understanding the TestResult/ TestResults entity type is also vital, as it provides individual execution results for a specific Test associated with a TestRun. This entity type is supported in both v3.0-preview and v4.0-preview Analytics versions.
Here's a list of the supported entity types and entity sets:
These entity types and entity sets provide a solid foundation for your data modeling needs, and understanding how they work can help you make the most out of your data analysis.
Sources
- https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data
- https://senturus.com/resources/data-modeling-in-azure-architecture/
- https://learn.microsoft.com/en-us/azure/devops/report/extend-analytics/data-model-analytics-service?view=azure-devops
- https://learn.microsoft.com/en-us/azure/databricks/transform/data-modeling
- https://azure.github.io/cloud-scale-data-for-devs-guide/schema-considerations.html
Featured Images: pexels.com