
To set up a secure Azure Databricks subnet, you'll need to create a virtual network (VNet) that meets Azure's security requirements. This involves specifying a unique name and IP address range for the VNet.
A VNet can be configured to use either a public or private IP address space, with private IP addresses being the more secure option. Azure recommends using a private IP address space for Databricks subnets.
By creating a separate subnet for Databricks, you can isolate your data processing workloads from other Azure resources and reduce the attack surface of your network. This is a key benefit of using a Databricks subnet.
Azure Databricks subnets can be configured to use a variety of security features, including network security groups (NSGs) and Azure Firewall.
Consider reading: Azure Gateway Subnet
Private Endpoints
Private Endpoints can be a game-changer for secure communication between Azure services.
You can create a private endpoint for Azure DataBricks to talk to Azure SQL Database using Private Endpoints. This involves adding a new Subnet to your databricks-vnet, creating a Private Endpoint, and configuring the DNS settings.
To create a private endpoint for Azure SQL Database, follow these steps:
- Add a new Subnet to your databricks-vnet
- Create a Private Endpoint and choose the Virtual network your using for DataBricks and select the new Subnet
- Configure the DNS settings by clicking on ‘Virtual network links’ and adding a new Virtual Network Link to your DataBricks VNet
This will allow Azure DataBricks to communicate with Azure SQL Database securely. You can also create private endpoints for Azure Machine Learning and Azure Databricks to communicate with each other.
For example, to allow Azure Machine Learning to communicate with Azure Databricks, you can create a private endpoint for Azure Machine Learning using the following steps:
- Select your Azure Machine Learning workspace and go to Networking, Private endpoint connections
- Create a new private endpoint and select the Virtual network that is used by Azure Databricks
- Configure the DNS settings by clicking on ‘Virtual network links’ and adding a new Virtual Network Link to your DataBricks VNet
Similarly, to allow Azure Databricks to communicate with Azure Machine Learning, you can create a private endpoint for Azure Databricks using the following steps:
- Select your Azure Databricks instance and go to Networking, Private endpoint connections
- Create a new private endpoint and select the Virtual network that is used by Azure Machine Learning
By following these steps, you can ensure secure communication between your Azure services using Private Endpoints.
Secure Cluster Connectivity
Secure Cluster Connectivity is a feature in Azure Databricks that allows you to enable secure connectivity between your workspace and cluster. You can enable it when creating a new workspace using the Azure portal or an ARM template.
To enable secure cluster connectivity on a new workspace, go to the Networking tab and set Deploy Azure Databricks workspace with Secure Cluster Connectivity (No Public IP) to Yes in the Azure portal. Alternatively, in an ARM template, set the enableNoPublicIp Boolean parameter to true in the Microsoft.Databricks/workspaces resource.
Secure cluster connectivity requires that your workspace uses VNet injection, and you might need to update your firewall or network security group rules to control ingress or egress from the classic compute plane.
With secure cluster connectivity, your workspace subnets are private subnets, and cluster nodes do not have public IP addresses. This means that egress from your workspace subnets is handled differently, depending on whether you use the default (managed) VNet or VNet injection.
If you use the default VNet, Azure Databricks automatically creates a NAT gateway for outbound traffic from your workspace’s subnets to the Azure backbone and public network. This NAT gateway incurs additional cost.
However, if you use VNet injection, Databricks recommends that your workspace has a stable egress public IP. This is useful for adding explicit outbound methods, such as an Azure NAT gateway or user-defined routes (UDRs), to ensure that your workspace can connect to external services.
To add explicit outbound methods, you can use an Azure NAT gateway or UDRs. Here are the options:
- Azure NAT gateway: Configure the gateway on both of the workspace’s subnets to ensure that all outbound traffic to the Azure backbone and public network transits through it.
- UDRs: Add direct routes or allowed firewall rules for the Azure Databricks secure cluster connectivity relay and other required endpoints.
It's essential to note that you should not use an egress load balancer with a workspace that has secure cluster connectivity enabled, as it can lead to risk of exhausting ports.
Egress Options
Secure cluster connectivity is a must-have for the most secure deployment, as Microsoft and Databricks strongly recommend.
With secure cluster connectivity enabled, your workspace subnets become private subnets, since cluster nodes don't have public IP addresses.
You'll need to consider the implementation details of network egress, which vary based on whether you use the default (managed) VNet or VNet injection.
Additional costs may be incurred due to increased egress traffic when using secure cluster connectivity.
To ensure stable egress public IP addresses, Databricks recommends using VNet injection with a stable egress public IP.
This is useful for adding IP addresses to external allow lists, such as when connecting to Salesforce.
However, Microsoft announced that default outbound access connectivity for virtual machines in Azure will be retired on September 30, 2025.
To avoid disruptions, add explicit outbound methods for your workspaces before that date.
You can use an Azure NAT gateway or user-defined routes (UDRs) to achieve this.
Here are your egress options:
- Azure NAT gateway: Use if your deployments only need some customization. Configure the gateway on both workspace subnets to ensure all outbound traffic transits through it.
- UDRs: Use if your deployments require complex routing requirements or your workspaces use VNet injection with an egress firewall.
Do not use an egress load balancer with a workspace that has secure cluster connectivity enabled, as it can lead to port exhaustion.
Setup and Prerequisites
To set up Azure Databricks subnet, you'll need an Azure Machine Learning workspace configured for network isolation. This is a critical prerequisite.
You'll also need an Azure Databricks deployment configured in a virtual network (VNet injection). Be aware that Azure Databricks requires two subnets, which cannot be used by the Azure Machine Learning workspace when creating a private endpoint.
It's essential to add a third subnet to the VNet used by Azure Databricks for the private endpoint.
The VNets used by Azure Machine Learning and Azure Databricks must use a different set of IP address ranges. This is a requirement for proper configuration.
Here are the specific prerequisites:
- An Azure Machine Learning workspace configured for network isolation.
- An Azure Databricks deployment configured in a virtual network (VNet injection).
- A third subnet added to the VNet used by Azure Databricks for the private endpoint.
- VNets used by Azure Machine Learning and Azure Databricks use different IP address ranges.
Frequently Asked Questions
Why do Databricks need two subnets?
Databricks requires two subnets to provide separate IP addresses for the host (Azure VM) and the container (Databricks runtime) that runs inside it. This setup ensures secure and isolated environments for both the host and container.
What is the difference between Databricks container and host subnet?
The Databricks host subnet enables communication between the workspace and Azure services, while the container subnet facilitates communication between Spark executors and the driver. This separation ensures secure and efficient data processing within the Databricks environment.
What is an Azure subnet?
An Azure subnet is a range of IP addresses within a virtual network, used for organization and security. It connects a virtual machine's Network Interface Card (NIC) to a specific virtual network.
Sources
- https://gregorsuttie.com/2023/05/15/azure-databricks-talking-to-azure-sql-database-using-private-endpoints/
- https://learn.microsoft.com/en-us/azure/databricks/security/network/classic/secure-cluster-connectivity
- https://sqlinsix.medium.com/vnet-injection-with-azure-databricks-fb4eb20dd3af
- https://medium.com/@antaliagacortes/restricting-traffic-from-azure-databricks-055e062dc2f5
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-securely-attach-databricks
Featured Images: pexels.com