Azure Data Factory Linked Services are a game-changer for scalable data pipelines. They enable you to connect to various data sources and services, allowing you to integrate and transform data from multiple systems into a unified data pipeline.
With Linked Services, you can connect to popular data sources such as Azure SQL Database, Amazon S3, and Salesforce, making it easy to integrate data from various systems.
By using Linked Services, you can also leverage Azure's scalability and reliability to build data pipelines that can handle large volumes of data.
Pipelines
Pipelines are a crucial part of Azure Data Factory, allowing you to automate the movement and transformation of data. With ADF, you can create data-driven workflows that can be triggered by various events, scheduled to run at specific times, or initiated manually.
You can build a data pipeline by defining linked services, creating datasets, designing the pipeline, and publishing and monitoring it. This process involves specifying connection details for data sources and destinations, defining data structures, and adding activities to orchestrate the data processing.
ADF provides a rich set of activities that can be combined to build complex workflows, including data movement, data transformation, control flow, and custom activities. You can use these activities to create data pipelines that move and integrate data from various sources, such as Azure Blob Storage, Azure SQL Database, and Azure Data Lake Storage.
The data movement activities in ADF ensure that data can be efficiently transferred between different systems for subsequent processing and analysis. You can also use ADF to copy data from on-premises and cloud source data stores to a centralized location in the cloud, such as Azure Data Lake Storage or Azure SQL Database.
Here are the key steps to build a data pipeline:
- Define Linked Services: Create linked services for your data sources and destinations, specifying connection details for each data store.
- Create Datasets: Define datasets that represent the data structures you want to work with in your pipelines.
- Design the Pipeline: Use the ADF pipeline designer to add activities that define the workflow for your pipeline.
- Publish and Monitor: Once your pipeline is designed, publish it to ADF and use the monitoring tools to track its execution and troubleshoot any issues.
By following these steps and using the advanced monitoring and alerting capabilities of ADF, you can ensure that your data pipelines are running smoothly and efficiently, and that any issues are promptly addressed.
HDInsight
Azure Data Factory offers two types of linked services for HDInsight: on-demand and bring your own (BYOC). The on-demand configuration allows the service to automatically create an HDInsight cluster to process data, which is created in the same region as the storage account associated with the cluster.
In an on-demand HDInsight linked service, the logs for jobs are copied to the storage account associated with the HDInsight cluster. The cluster's user credentials are used to log in to the cluster for in-depth troubleshooting. You're only charged for the time when the HDInsight cluster is up and running jobs.
You can use an on-demand HDInsight cluster to run Hive, Pig, Spark, MapReduce, and Hadoop Streaming activities. It typically takes 20 minutes or more to provision an Azure HDInsight cluster on demand.
HDInsight On-Demand
HDInsight On-Demand is a fully managed computing environment that's automatically created by the service before a job is submitted to process data and removed when the job is completed. This configuration is currently supported only for Azure HDInsight clusters.
It typically takes 20 minutes or more to provision an Azure HDInsight cluster on demand. The on-demand HDInsight cluster is created under your Azure subscription, and you can see the cluster in your Azure portal when it's up and running.
The logs for jobs that are run on an on-demand HDInsight cluster are copied to the storage account associated with the HDInsight cluster. You're charged only for the time when the HDInsight cluster is up and running jobs.
Here are some key points to consider about on-demand HDInsight linked service:
- The on-demand HDInsight cluster is created in the same region as the storage account associated with the cluster.
- You can use a Script Action with the Azure HDInsight on-demand linked service.
- The clusterUserName, clusterPassword, clusterSshUserName, and clusterSshPassword defined in your linked service definition are used to log in to the cluster for in-depth troubleshooting during the lifecycle of the cluster.
Lake Analytics
Lake Analytics is a powerful tool that allows you to process and analyze large datasets. You can create an Azure Data Lake Analytics linked service to link an Azure Data Lake Analytics compute service to a data factory or Synapse workspace.
The type property in the linked service should be set to AzureDataLakeAnalytics. This is a required field and must be specified.
Azure Data Lake Analytics Account Name is another required field, which should be set to your Azure Data Lake Analytics account name.
You can also specify the Azure Data Lake Analytics URI, but it's not required. This is useful if you need to provide a custom URI for your analytics service.
To authenticate with Azure Data Lake Analytics, you'll need to specify the Azure subscription ID, resource group name, and tenant information. However, these fields are not required and can be left blank if you don't need to use them.
Here's a summary of the required fields for the Azure Data Lake Analytics linked service:
By specifying these required fields, you'll be able to create a linked service that connects to your Azure Data Lake Analytics compute service.
Principal Authentication
Principal authentication is a crucial aspect of Azure Data Factory linked services. You can use service principal authentication to create HDInsight clusters on your behalf.
To use service principal authentication, register an application entity in Microsoft Entra ID and grant it the Contributor role of the subscription or the resource group in which the HDInsight cluster is created.
The required values for service principal authentication are: Application ID, Application key, and Tenant ID.
Here's a breakdown of the properties you need to specify for service principal authentication:
Service principal authentication can be a bit complex, but it's a reliable option for connecting to your data sources.
Data Sources
Azure Data Factory Linked Services rely on various data sources to process and transform data. These data sources can be on-premises or in the cloud, such as Azure Blob Storage, Azure SQL Database, and Amazon S3.
Azure Data Factory supports a wide range of data sources, including relational databases like Azure SQL Database and Oracle, as well as NoSQL databases like Azure Cosmos DB and MongoDB. Additionally, it can also connect to cloud-based storage services like Azure Blob Storage and Amazon S3.
In Azure Data Factory, linked services are used to connect to these data sources, allowing you to access and process data from various locations.
BYOC
BYOC allows users to register an existing computing environment as a linked service, which is managed by the user and used to execute activities.
This configuration is supported for a variety of compute environments, including Azure HDInsight, Azure Batch, and Azure Machine Learning.
You can also use Azure Data Lake Analytics, which is a great option for big data processing, and Azure SQL DB, Azure Synapse Analytics, and SQL Server for database management.
The supported compute environments are listed below:
- Azure HDInsight
- Azure Batch
- Azure Machine Learning
- Azure Data Lake Analytics
- Azure SQL DB, Azure Synapse Analytics, SQL Server
SQL Server
To connect to a SQL Server database, you create a SQL Server linked service. This linked service is used with the Stored Procedure Activity to invoke a stored procedure from a pipeline. See the SQL Server connector article for details.
You can create a SQL Server linked service using SQL authentication, which is the default option. In this case, you specify a username and password to connect to the database.
You can store the password in Azure Key Vault, but you can't reference just the username from Azure Key Vault. You'll either need to reference the entire connection string or just the password.
Frequently Asked Questions
What is the difference between integration runtime and linked services?
An integration runtime provides the compute environment for activities, while a linked service defines the target data store or compute service. Think of integration runtime as the engine that runs activities, and linked services as the destination where data is sent.
What is an Azure linked service?
An Azure linked service is a connection definition that links your dataset to an external data source, specifying how to access its data. It defines the connection details, allowing your data to be linked and integrated with external resources.
What type of authentication is Azure linked service?
Azure linked service supports SQL Authentication, Managed Identity, and Service Principal for secure connection establishment. Additionally, a key vault can be used to store SQL Authentication details, eliminating the need for manual entry in the BimlFlex form.
Sources
- https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/data-factory/compute-linked-services.md
- https://davidalzamendi.com/azure-data-factory-linked-services/
- https://apix-drive.com/en/blog/other/what-is-linked-service-in-azure-data-factory
- https://www.cathrinewilhelmsen.net/linked-services-azure-data-factory/
- https://www.sprinkledata.com/blogs/azure-data-factory-the-ultimate-guide-to-data-integration
Featured Images: pexels.com