A federated data lake is a scalable solution for managing data across multiple sources and locations. This approach allows for the integration of various data sets, enabling organizations to make more informed decisions.
By using a federated data lake, businesses can reduce data silos and improve data accessibility. This is achieved by creating a centralized platform that allows for the sharing and collaboration of data across different departments and teams.
Data governance is a critical aspect of a federated data lake, ensuring that data is accurate, consistent, and secure. This involves establishing clear policies and procedures for data management, as well as implementing robust security measures to protect sensitive information.
A well-designed federated data lake can help organizations unlock the full potential of their data, driving business growth and innovation.
Prerequisites
To start building a federated data lake, you'll need to prepare a few things. First, sign up for a Starburst Galaxy account, which will serve as our query engine.
A Starburst Galaxy account is essential for this project, as it will help us manage and analyze our data.
Next, you'll need to have an AWS account with credentials, as S3 will be used as both a source and target in our data lakehouse architecture.
Having an AWS account with S3 access will allow us to store and manage our data efficiently.
You'll also need MongoDB credentials, specifically for MongoDB Atlas, which will be used in our data lakehouse architecture.
MongoDB credentials are necessary to connect to our MongoDB database and retrieve the data we need.
Lastly, if you want to build a liveboard at the end of this project, you'll need ThoughtSpot credentials (although this is optional).
Here are the prerequisites you'll need to complete this tutorial:
- Starburst Galaxy account
- AWS account credentials (S3)
- MongoDB credentials (MongoDB Atlas)
- ThoughtSpot credentials (optional)
Set Up MongoDB and AWS Sources
To set up your MongoDB and AWS sources, you'll need to start by reviewing the source files and uploading each file to its respective location. Make sure both sources are set up in the same region, which will help streamline the process.
Create an S3 bucket for your Amazon source, with a unique identifier like "pokemon-demo-mm". Within that bucket, create a subfolder to hold the Pokémonspawns CSV file, such as "pokemon-spawns-csv/pokemon-spawns.csv". This will help keep your files organized and easily accessible.
Next, you'll need to create an AWS access key that can authenticate the connection between S3 and Starburst Galaxy. This will allow you to connect your AWS source to your data lake, enabling seamless access to your data.
Here's a quick rundown of the steps:
By following these steps, you'll be well on your way to setting up your MongoDB and AWS sources and integrating them into your federated data lake.
Configuring and Creating
In a federated data lake, setting up the foundation is crucial. To start, you'll need to create catalogs and a cluster in Starburst Galaxy.
You'll create two catalogs, one for Amazon S3 Iceberg and one for MongoDB, naming them aws_pokemon and mongo_pokedex, respectively. Be sure to authenticate to S3 and MongoDB using your access key and connection URL.
After creating the catalogs, you'll create a cluster named pokemon-cluster and attach both catalogs to it. Select your cluster size, type, and cloud provider region to create the cluster.
Once your cluster is set up, you'll configure role-based access control for your object storage location by adding a new location privilege to the account admin role.
BigQuery to Google Cloud Storage
When working with BigQuery, you have two options for storing data from an external source: permanent tables and temporary tables. A permanent table is created in a BigQuery dataset linked to your external data source, allowing you to share the table with others who have access to the underlying external data source.
You can create a permanent table in BigQuery using the BigQuery command line, which is useful when you need to share the table with others. This approach is ideal for one-time, ad-hoc queries or for one-time extract, transform, or load (ETL) workflows.
For one-time queries or ETL workflows, consider using a temporary table. You submit a command that includes a query and creates a non-permanent table linked to the external data source. This approach is useful when you don't need to share the query or table with others.
To choose between permanent and temporary tables, consider the following:
By choosing the right option, you can ensure that your BigQuery setup is efficient and effective for your specific needs.
Configure Starburst Galaxy
To configure Starburst Galaxy, start by creating an Amazon S3 Iceberg catalog named aws_pokemon. Authenticate to S3 through your previously created access key.
Naming is important for your future querying, so keep in mind your selected names for each catalog and cluster. For your default directory name, enter the name of your favorite Pokémon. Enable both creating and writing to external tables, and connect the catalog. Save the account admin access controls after connecting.
Next, create a MongoDB catalog named mongo_pokedex. Authenticate to MongoDB using either a direct connection or via SSH tunnel. I personally connected using my connection URL that I found extremely easily from my MongoDB Compass setup. Save the account admin access controls after connecting.
After creating both catalogs, create your cluster named pokemon-cluster. Attach both previously created catalogs, aws_pokemon and mongo_pokedex to this cluster. Select your cluster size, cluster type, and cloud provider region. Then, create your cluster.
Finally, configure role-based access control for your object storage location. Navigate to the Access control tab and select the Roles and privileges dropdown. Click into the highlighted account admin role, and add a new location privilege. Enter the S3 URI followed by a /* for your newly created S3 bucket. For example: s3://pokemon-demo-mm/*.
Create Your Structure
To create your structure layer, you'll first create a schema to hold all the Iceberg tables. This schema will serve as the foundation for your structure tables.
We'll create the structure table for the Pokémon Go spawn data stored in S3. Replace the AWS bucket name with your own to ensure the SQL statement works correctly.
The consume table will combine the Type 1, Type 2, and Mega Evolution status from the Pokédex for each Pokémon found in the Pokémon spawn data. This table will be a crucial part of your structure layer.
To create the consume table, you'll need to restrain the latitude and longitude to focus on the San Francisco Bay Area. This will help you narrow down the data to only include Pokémon found within this specific region.
Data Governance and Security
Data governance and security are crucial aspects of a federated data lake. Fine-grained controls allow you to manage access at the table, row, and column levels, limiting users to data their jobs justify.
Multiple authentication options combined with role-based and attribute-based authorization policies ensure that access is authorized. This helps prevent unauthorized access to sensitive data.
Data duplication security risks are avoided since Starburst's federated platform leaves data at the source. End-to-end encryption protects all data in transit, adding an extra layer of security.
Data logging and real-time monitoring improve compliance and enforcement of your data governance policies, ensuring that your data is secure and trustworthy.
What Is Governance?
Data governance is the practice of managing and regulating data to ensure it's accurate, secure, and compliant with regulations. It's a crucial aspect of data management that helps organizations make informed decisions.
Data governance involves establishing policies, procedures, and guidelines for data collection, storage, and usage. This helps to prevent data breaches and ensures data quality.
Federated data governance is a type of data governance that involves multiple organizations or departments working together to manage and regulate data. This approach can be beneficial for large organizations with complex data systems.
The pros of federated data governance include improved data consistency and reduced data duplication. However, it can also be challenging to implement and maintain due to the need for coordination among multiple stakeholders.
Ultimately, effective data governance requires a combination of technical expertise, business acumen, and organizational buy-in. By establishing clear policies and procedures, organizations can ensure their data is secure, accurate, and compliant with regulations.
Security and Governance
Data governance is all about ensuring that your organization's data is secure and accessible only to those who need it. Multiple authentication options combined with role-based and attribute-based authorization policies limit users to data their jobs justify.
Starburst's federated platform leaves data at the source, avoiding data duplication security risks. This means you don't have to worry about sensitive data being copied or stored in multiple places.
Fine-grained controls let you manage access at the table, row, and column levels, giving you precise control over who sees what. This is especially useful for organizations with sensitive data that needs to be protected.
Data logging and real-time monitoring improve compliance and enforcement of your data governance policies. This helps you stay on top of who's accessing what data and when.
Trusted Execution Environments (TEEs) create secure environments for executing sensitive code. TEEs ensure that the correct application is executing and that the data in the application is not revealed elsewhere.
End-to-end encryption protects all data in transit, giving you an extra layer of security. This is especially important for organizations that handle sensitive data, such as financial information.
By implementing these security measures, you can ensure that your organization's data is secure and compliant with regulations.
Reducing Integrity Risks
Data integrity is crucial for any organization, and it's essential to minimize the risks associated with it. Companies with complex data systems, such as those in the finance and healthcare industries, are more at risk of having low data integrity.
To learn more about what data integrity is and why it's important, check out our article section on "What is Data Integrity?"
Data integrity risks can be minimized by understanding the importance of data integrity and taking steps to address it. This includes identifying and addressing data quality issues, implementing data validation and verification processes, and ensuring data is accurate and up-to-date.
Data quality issues can arise from human error, system glitches, or intentional data manipulation. Companies should have processes in place to detect and correct data errors quickly.
Data integrity is a critical aspect of data governance and security, and minimizing risks is essential to maintaining the trust of customers and stakeholders. By taking proactive steps to address data integrity risks, companies can reduce the likelihood of data breaches and other security threats.
Trusted Execution Environments (TEEs)
Trusted Execution Environments (TEEs) are hardware-based security technologies that create secure environments for executing sensitive code. They ensure that the correct application is executing and that the data in the application is not revealed elsewhere.
TEEs can provide a solution for producing a comprehensive view of financial regulations and due diligence insights across geographies, ensuring security, verifiability, and compliance with legal constraints. This is demonstrated in the TEADAL shared financial data governance pilot.
In this secure environment, sensitive data remains confidential, which is particularly important in industries where data regulation is strict, such as in medical records. Secure Multi-Party Computation (MPC) can be used to enable multiple hospitals to collaboratively produce holistic insights from patient datasets without compromising the confidentiality of the individual patients' records.
The use of TEEs can be a game-changer in industries where data security is paramount, allowing for the execution of sensitive code while maintaining confidentiality. This is especially relevant in the medical field, where patient data is highly sensitive.
Frequently Asked Questions
What is the difference between data federation and data lake?
Data federation virtualizes multiple data sources, providing a unified view without moving or copying raw data, whereas a data lake ingests large volumes of raw data for analysis and exploration. This difference affects how data is stored, accessed, and utilized in each approach.
What does federated mean in data?
Federated in data refers to the integration of multiple data sources into a single, unified format. This allows for easier access and querying of data from various sources.
What is a federated data platform?
A federated data platform is a software system that connects and integrates data from multiple sources, providing a secure and centralized environment for accessing information. It enables organizations to bring together disparate data, making it easier to access and use for informed decision-making.
Sources
- https://www.starburst.io/blog/build-a-federated-data-lakehouse-with-starburst-galaxy/
- https://www.unleash.so/a/community/knowledge-management/what-is-a-federated-data-lake
- https://www.starburst.io/data-glossary/data-federation/
- https://teadal.eu/2023/12/07/unlocking-the-future-of-privacy-aware-federated-data-lakes-insights-from-teadal/
- https://cloud.google.com/blog/products/gcp/accessing-external-federated-data-sources-with-bigquerys-data-access-layer
Featured Images: pexels.com