Setting up an Azure environment is a crucial step in deploying data pipelines with Azure Data Factory (ADF). Azure provides a scalable and secure platform for data integration, and understanding its core components is essential for ADF configuration.
Azure environments can be set up in various regions, with each region having its own set of services and pricing. The Azure portal is the primary interface for managing Azure resources.
To start setting up an Azure environment, you need to create an Azure account, which can be done for free. You can also use an existing Azure subscription if you already have one.
Azure Data Factory is a cloud-based data integration service that allows you to create, schedule, and manage your data pipelines. ADF is built on top of Azure's scalable and secure infrastructure.
Azure Environment Setup
To set up an Azure environment, start by creating a subscription, which can be done through the Azure portal or by contacting a Microsoft representative.
You can choose from various Azure services, including Azure Storage, Azure Databricks, and Azure Active Directory.
To create a resource group, navigate to the Azure portal and click on "Create a resource group", then select a location and enter a name.
Azure provides a free tier for many services, allowing you to get started with minimal cost.
The Basics
Azure Data Factory is a powerful tool for data integration, and understanding its basics is essential for setting up a robust environment.
A pipeline in Azure Data Factory is a grouping of different activities, which can include data copying, transformation using Data Flows, and various computations on Azure.
To specify when a data pipeline runs, you can use triggers in Azure Data Factory. There are different types of triggers supported, including scheduled triggers that enable you to specify a time of day and timezone, and data window triggers that can fire based on a series of fixed-size, non-overlapping, contiguous time intervals.
A key concept in Azure Data Factory is the Integration Runtime (IR), which is the compute infrastructure used for performing various data integration tasks, such as data movement, data flow, and running SSIS packages.
Here are the different types of triggers supported in Azure Data Factory:
How It Works?
To set up an Azure environment, you'll need to understand how the incremental deployment feature works. It uses a Deployment State kept in a JSON file and writes/read it to/from Azure BLOB Storage.
The JSON file contains the previous Deployment State, which is read from Storage when the mode is ON. This allows the process to identify which objects are unchanged and excludes them from deployment.
The process calculates MD5 hashes of deployed objects and merges them to the previous Deployment State. This ensures that only changed objects are redeployed.
To use this feature, you'll need to set two option parameters: IncrementalDeployment = True and IncrementalDeploymentStorageUri = https://sqlplayer2020.file.core.windows.net/adftools (example).
ADF Setup
To set up Azure Data Factory (ADF), you'll first need to configure your source. This involves selecting the authentication method, such as Account key, SAS URI, Service Principal, or Managed Identity, depending on your needs.
You can choose from various authentication methods, including Account key, SAS URI, Service Principal, and Managed Identity. For secure storage of secrets, consider using an Azure Key Vault.
To create a source dataset, go to the Source tab and select + New. Then, select Azure Blob Storage and click Continue. Next, choose the format type of your data and click Continue. In the Set Properties dialog box, enter a name for your dataset and select the checkbox for First row as header. Under the Linked service text box, select + New to create a linked service.
Configure Source
To configure your source in ADF, you'll need to create a source blob. Launch Notepad, copy the following text, and save it as an emp.txt file on your disk: FirstName,LastName John,Doe Jane,Doe.
You can choose from several authentication methods for your source data store, including Account key, SAS URI, Service Principal, and Managed Identity. It's recommended to use an Azure Key Vault to store secrets securely.
To create a source dataset, go to the Source tab and select + New. Then, select Azure Blob Storage as the source data type.
You'll need to select the format type of your data and enter a name for your dataset, such as SourceBlobDataset. Make sure to select the checkbox for First row as header.
Here are the steps to create a linked service:
- Enter a name for your linked service, such as AzureStorageLinkedService.
- Test the connection and select Create to deploy the linked service.
- Navigate to the adftutorial/input folder and select the emp.txt file.
After creating the linked service, navigate back to the Set properties page and select the emp.txt file from the adftutorial/input folder.
Exist Check
You need to check if an ADF already exists before creating a new one. This is a crucial step in the ADF setup process.
If an ADF does exist, the system will load the latest deployment state from storage if IncrementalDeployment is ON.
You must have the appropriate permission to create a new instance of ADF, and a location parameter is required for this action.
The system will read the ADF from JSON files if it already exists, as indicated by a specific line in the log.
Replacing All Properties
Replacing all properties environment-related is a crucial step in the CI & CD process, executed only when the [Stage] parameter has been provided.
This step ensures that each environment has the same code except for selected properties, which are typically stored in JSON files within the code repository.
The process reads a flat configuration file with all required values per environment, containing key-value pairs for properties such as Data Factory name, Azure Key Vault URL, and Linked Service properties.
Here's an example of what such a config file might look like:
The config file includes values for the type, name, path, and value of each property, making it easy to manage and update environment-specific settings.
Trigger Manually
To manually trigger a pipeline, select Trigger on the toolbar and then choose Trigger Now.
You'll see a Pipeline Run page, where you can select OK to proceed.
On the Monitor tab, you'll find a pipeline run triggered by a manual trigger. You can view activity details and rerun the pipeline using links under the PIPELINE NAME column.
To see activity runs associated with the pipeline run, select the CopyPipeline link under the PIPELINE NAME column.
For details about the copy operation, select the Details link (eyeglasses icon) under the ACTIVITY NAME column.
To refresh the view, select Refresh.
Verify that two more rows are added to the emp table in the database.
Scheduled Trigger
To set up a scheduled trigger in Azure Data Factory, navigate to the Author tab and click on the pipeline you want to trigger. Then, select New/Edit from the Trigger dropdown menu.
You can create a schedule trigger by selecting + New in the Choose trigger area. In the New Trigger window, enter a name for your trigger, such as "RunEveryMinute", and update the Start date to a time in the past. This will allow the trigger to start taking effect once the changes are published.
Under Time zone, select the drop-down list to choose your time zone. Set the Recurrence to Every 1 Minute(s) to run the pipeline every minute. You can also specify an end date by selecting the checkbox and updating the End On part to a few minutes past the current datetime.
A cost is associated with each pipeline run, so be sure to set the end date appropriately. On the Edit trigger page, review the warning and select Save to save your changes.
To publish the changes, click Publish all on the pipeline. Then, go to the Monitor tab to see the triggered pipeline runs. You can switch from the Pipeline Runs view to the Trigger Runs view by selecting Trigger Runs on the left side of the window.
Here's a step-by-step summary of the process:
- Navigate to the Author tab and click on the pipeline you want to trigger.
- Select New/Edit from the Trigger dropdown menu.
- Create a schedule trigger by selecting + New in the Choose trigger area.
- Enter a name for your trigger and update the Start date to a time in the past.
- Set the Recurrence to Every 1 Minute(s) and specify an end date.
- Review the warning and select Save to save your changes.
- Click Publish all on the pipeline.
- Go to the Monitor tab to see the triggered pipeline runs.
Using Tokens as Dynamic Values
You can use token syntax to define expressions that get replaced by values after reading a CSV config file. This is especially useful when working with Azure DevOps pipeline variables.
PowerShell expressions for environment variables are supported, which means you can use $Env:VARIABLE or $($Env:VARIABLE) to reference environment variables.
Assuming you have an Environment Variable named USERDOMAIN with value CONTOSO, you can use the following line in your config file:
This will become after reading from disk, which is a big time-saver.
By leveraging variables defined in Azure DevOps pipeline, you can replace tokens without needing an extra task. This is possible because all pipeline's variables are available as environment variables within the agent.
How Work?
Azure Data Factory (ADF) is a powerful tool for managing and processing data. It can collect data from various sources, including cloud and on-premises data stores, and move it to a centralized location for processing.
ADF offers a range of activities for transforming and enriching data, including mapping data flows and external activities for executing transformations on compute services like Spark and HDInsight Hadoop.
A pipeline is a logical grouping of activities that execute a unit of work. Together, the activities in a pipeline execute a task.
Here are the main steps involved in ADF's data processing workflow:
- Collect data from various sources, including cloud and on-premises data stores.
- Move the data to a centralized location for processing.
- Transform and enrich the data using activities like mapping data flows and external activities.
- Load the refined data into a data warehouse or other destination.
ADF has built-in support for pipeline monitoring, allowing you to track pipeline performance and set up alerts for failures or delays. You can use Azure Monitor to track pipeline performance and view detailed logs.
Deployment and Management
In Azure Data Factory, deployment is a crucial step that's handled efficiently by the smart mechanism that publishes all objects in the right order, eliminating the need for developers to worry about object names causing deployment failures.
The deployment process can be monitored through the log, where you'll see lines indicating the deployment of all ADF objects and updating the deployment state.
Azure Data Factory offers built-in monitoring and management capabilities, allowing you to track pipeline performance, set up alerts for failures or delays, and view detailed logs.
Selected Folder Only
To publish objects from a specific folder in your ADF, you'll need to follow a three-step process.
First, load all ADF objects from your local folder. This will give you a complete list of objects to work with.
Next, execute a function that returns a list of objects located in the selected folder in your ADF. This list will contain the objects you want to publish.
Finally, add the returned list of objects to the "Includes" in Publish Option.
Monitoring and Management
Monitoring and Management is a crucial aspect of Azure Data Factory (ADF) deployment. You can use Azure Monitor to track pipeline performance and set up alerts for failures or delays.
Azure Monitor provides a range of features to help you monitor your ADF pipelines, including tracking pipeline performance and setting up alerts for failures or delays. This can be especially useful if you're working with complex pipelines or have multiple pipelines running concurrently.
To monitor a Copy Activity run, go to your ADF Author & Monitor UI and click on the Monitor tab page. From there, you can view a list of pipeline runs and click on a pipeline name to access the list of activity runs in the pipeline run.
Here's a summary of the key features for monitoring a Copy Activity run:
- Associate a pipeline with a trigger
- Monitor pipeline runs natively in the ADF user experience
- View a list of pipeline runs on the Monitor tab page
- Click on a pipeline name to access the list of activity runs in the pipeline run
By using these features, you can get a better understanding of your ADF pipeline's performance and make data-driven decisions to improve its efficiency.
Save Deployment State
The Save Deployment State step is a crucial part of the deployment process.
After the deployment, the tool prepares a list of deployed objects and their hashes using the MD5 algorithm.
This array is then wrapped up in JSON format and stored as a blob file with the name {ADF-Name}.adftools_deployment_state.json in the provided Storage.
This deployment state speeds up future deployments by identifying objects that have been changed since the last time.
If IncrementalDeployment is set to false in Publish Options, this step can be skipped.
You'll see a warning in the console log if IncrementalDeployment is true and IncrementalDeploymentStorageUri is empty.
Sources
- https://learn.microsoft.com/en-us/azure/data-factory/tutorial-copy-data-portal
- https://github.com/Azure-Player/azure.datafactory.tools
- https://www.mssqltips.com/sqlservertip/6510/using-azure-devops-ci-cd-to-deploy-azure-data-factory-environments/
- https://www.pluralsight.com/resources/blog/cloud/what-is-azure-data-factory-a-beginners-guide-to-adf
- https://k21academy.com/microsoft-azure/data-engineer/azure-data-factory/
Featured Images: pexels.com