To create a mount point in Azure Databricks, you need to have a storage account with the necessary permissions.
Azure Databricks supports various storage systems, including Azure Blob Storage, Azure Data Lake Storage Gen2, and Amazon S3.
Mounting a storage system in Databricks allows you to access data from that system without having to download it to your local machine.
This is especially useful for large datasets that wouldn't fit on your machine.
To create a mount point, you'll need to use the Databricks CLI or the Azure portal.
This involves specifying the storage system, the mount point name, and the storage account credentials.
Additional reading: Freezing Point
What Is Mounting in Databricks?
Mounting in Databricks is a way to access object storage as if it were on the local file system. This allows for easy access to external drives.
You can use the dbutils.fs.mount() command to mount a location in Databricks. This command is used to mount a storage container, such as Azure Blob Storage or Azure Data Lake Storage.
Take a look at this: Create Access Point for S3 Bucket
To mount a location, you'll need to provide the storage account name, Access Key, or SAS token. You can get this information from the Azure portal or by generating a container level SAS with read and list permissions.
Here are the pre-requisites for mounting a storage container in Databricks:
- An Azure Databricks Service.
- A Databricks Cluster (compute).
- A Databricks Notebook.
- An Azure Data Lake Storage or Blob Storage.
Mounting a storage container involves several steps, including creating the container and blobs, mounting the container, verifying the mount point, listing the contents, and unmounting the container.
Discover more: Azure Storage Containers
Mounting a Data Lake
To mount a data lake in Databricks, you'll need to use the dbutils.fs.mount() command. This command allows you to access object storage as if it were on the local file system.
You can use either Access Key or SAS token for authentication, and it's essential to note that the mount example provided does not re-mount an existing mount point. To re-mount, you must unmount and then mount again.
To get started, create a storage container and blobs, then mount the container using the dbutils.fs.mount() command. You can use the following code snippet to mount an Azure Blob Storage container:
If this caught your attention, see: How to Mount a Projector without Drilling?
dbutils.fs.mount(
source = "wasbs://container@storage_account_name.blob.core.windows.net/",
mount_point = "/mnt/data",
extra_configs = {"fs.azure.account.key.storage_account_name": "Access Key"}
)
Make sure to replace the storage account name, Access Key, and SAS token with your own values. You can also use the SAS token for authentication by replacing the Access Key with the SAS token.
Once you've mounted the data lake, you can verify the mount point using the dbutils.fs.mounts() command. This will list all the mounted locations, including the data lake you just mounted.
Here's an example of how to list the contents of the mounted data lake using the dbutils.fs.ls() command:
dbutils.fs.ls("/mnt/data")
This will list the file info, including the path, name, and size. You can also use the spark.read command to read data from the mount point.
To unmount the data lake, use the dbutils.fs.unmount() command:
dbutils.fs.unmount("/mnt/data")
Remember to replace the mount point with your own mount point.
Here's a summary of the steps to mount a data lake:
A different take: Create Azure Data Lake Storage Gen2
1. Create a storage container and blobs
2. Mount the container using dbutils.fs.mount()
3. Verify the mount point using dbutils.fs.mounts()
4. List the contents using dbutils.fs.ls()
5. Unmount the data lake using dbutils.fs.unmount()
By following these steps, you can easily mount a data lake in Databricks and access its contents as if they were on the local file system.
Additional reading: Azure One Lake
Securing Connectivity
Securing Connectivity is crucial when creating a mount point in Azure Databricks. To access Azure Storage / ADLS gen2 securely, you can use either Service Endpoints or Azure Private Link. Both options require the ADB workspace to be VNET injected.
There are two types of PaaS services in Azure: dedicated and shared. Azure Storage / ADLS gen2 is a shared service, which means it uses a shared architecture and cannot be deployed within a single customer network.
To connect securely to ADLS from ADB, you'll need to follow these steps: deploy ADB into a VNet, create a private storage account with a private endpoint, integrate the private endpoint with a private DNS zone, link the VNets with the private DNS zone, peer the VNets, and enable the storage firewall.
For more insights, see: How to Create Service Principal in Azure
Service Principal & OAuth
Service Principal & OAuth is a crucial part of securing connectivity.
A service principal is a unique identity for an application or service, used to authenticate and authorize access to resources.
Using a service principal allows you to manage permissions and access control for your applications, making it easier to secure your connectivity.
This approach is especially useful when working with multiple applications or services that need to access shared resources.
Service principals can be used with OAuth, a widely adopted authorization framework that enables secure, delegated access to protected resources.
OAuth 2.0 is the most commonly used version, and it's supported by most major cloud providers.
By using a service principal and OAuth, you can ensure that only authorized applications have access to your resources, reducing the risk of unauthorized access or data breaches.
This is especially important when working with sensitive data or critical systems.
Expand your knowledge: Azure Create New App Service
Securing Connectivity
Securing connectivity to Azure services is crucial for a secure and private data storage. Azure Storage / ADLS gen2 is a shared service that uses a shared architecture, making it accessible only from clients deployed within Azure VNets and not accessible from the internet.
Discover more: Create Shared Folder in Onedrive
To access ADLS gen2 securely from Azure Databricks, there are two options available: Service Endpoints and Azure Private Link. Both options require the ADB workspace to be VNET injected.
Azure Databricks can connect privately and securely with Azure Storage via private endpoint using a hub and spoke configuration. This involves deploying Azure Databricks into a VNet, creating a private storage account with a private endpoint, and integrating the private endpoint with a private DNS zone.
Here are the steps to follow for a secure connection:
- Deploy Azure Databricks into a VNet using the Portal or ARM template.
- Create a private storage account with a private endpoint and deploy it into a different VNet.
- Integrate the private endpoint with a private DNS zone.
- Link the VNets with the private DNS zone and peer the VNets.
- Enable the storage firewall and add the ADB VNet to communicate with the storage account (optional).
By following these steps, you can ensure a secure and private connection between Azure Databricks and Azure Storage.
Aad Credential Passthrough
Aad Credential Passthrough is a feature that allows Azure Active Directory (AAD) credentials to be automatically passed through to a connected service, eliminating the need for users to re-enter their credentials. This is convenient for users and administrators alike.
AAD Credential Passthrough can be used with a wide range of services, including Azure SQL Database, Azure Storage, and Azure Databricks.
This feature uses a secure token service to authenticate users, ensuring that credentials are not stored or transmitted in plain text.
AAD Credential Passthrough is particularly useful for applications that require frequent authentication, such as data scientists working with Azure Databricks.
To enable Aad Credential Passthrough, administrators must configure the feature in the Azure portal, specifying the connected service and authentication options.
By leveraging Aad Credential Passthrough, organizations can streamline their authentication processes and reduce the risk of credential-related errors.
Frequently Asked Questions
How to create mount point in Azure Databricks using SAS token?
To create a mount point in Azure Databricks using a SAS token, first generate a SAS token and store it in Azure Key Vault, then use it to grant permissions to Databricks and write PySpark code to mount the point.
Sources
- https://techcommunity.microsoft.com/blog/azurepaasblog/mount-adls-gen2-or-blob-storage-in-azure-databricks/3802926
- https://azureops.org/articles/mount-and-unmount-data-lake-in-databricks/
- https://bigdataprogrammers.com/create-mount-point-in-azure-databricks-using-service-principal-and-oauth/
- https://github.com/hurtn/datalake-ADLS-access-patterns-with-Databricks/blob/master/readme.md
- https://www.mssqltips.com/sqlservertip/6931/mount-azure-data-lake-storage-gen2-account-databricks/
Featured Images: pexels.com