Azure Data Factory CI/CD with DevOps Pipelines is a game-changer for data integration and processing. By integrating Azure Data Factory with DevOps Pipelines, you can automate the build, test, and deployment of your data pipelines.
This integration enables continuous integration and continuous deployment (CI/CD), allowing you to release changes to your data pipelines faster and with greater confidence. Azure Data Factory CI/CD with DevOps Pipelines is a powerful tool for data engineers and analysts.
Azure Data Factory supports CI/CD pipelines through Azure DevOps Pipelines, which enables you to automate the build, test, and deployment of your data pipelines. This integration also supports Azure Pipelines, which provides a scalable and secure way to automate your data pipelines.
Prerequisites
To get started with Azure Data Factory CI/CD, you'll need an Azure Subscription with the ability to create resource groups and resources with the "Owner" role assignment.
Having some basic knowledge of creating Azure Data Factory pipelines is also beneficial, as it will help you navigate the process more smoothly.
You'll need to have the "Owner" privileges to create a service principal that provides DevOps access to your Data Factories within your Resource groups. Without this, you won't be able to proceed.
Azure Data Factory CI/CD requires a solid understanding of Azure and its various tools, so make sure you have a good grasp of these concepts before diving in.
Setting Up
To set up your Azure environment, create three resource groups and three data factories through the Azure Portal. Each pair will resemble one of the three environments.
Resource Group names must be unique within your subscription, but Data Factories must be unique across Azure. This can be done through the Azure Portal.
If you're already familiar with the Azure portal, you can skip this step by running a PowerShell script from the GitHub Repository. Just make sure to update the variables within the script accordingly.
On top of the Azure home page, click on “Create a resource” to begin the process.
UAT and Prod Resource Groups
Creating UAT and Prod Resource Groups is a crucial step in setting up your Azure Data Factory CI/CD pipeline. To do this, you'll need to create two new Resource Groups, one for UAT and one for Production.
Follow the same steps you used to create your first Resource Group to create the UAT Resource Group. Once created, you should see it in your Azure portal.
Similarly, create the PROD Resource Group using the same process as before. You'll also see it listed in your Azure portal once it's created.
With all three Resource Groups - Dev, UAT, and Prod - in place, you'll be able to manage your Azure Data Factory pipeline more effectively.
Factories
Creating Azure Data Factories is a crucial step in building a robust Azure Data Factory CI/CD pipeline. To start, you'll need to create Data Factories in each respective Resource Group, using a naming scheme that includes your initials.
The naming scheme for Data Factories is "Initials-warehouse-dev-df". However, since Azure Data Factory names must be unique across all of Azure, you might need to add a random number(s) to the end of your initials to ensure uniqueness.
Your Project
You've created an organization in DevOps and a project for Azure Data Factory. This project will contain your repository for Azure Data Factory.
The project is named "Azure Data Factory" and is set to private visibility. You've also selected Git for version control and Basic for work item process.
Microsoft has extensive documentation on different types of version controls and work item processes.
You should explore the options within the project, particularly the "Repos" and "Pipelines" services visible in the left menu.
Here's a quick rundown of the key settings for your project:
Now that your project is set up, you can start creating pipelines and connecting your Azure Data Factory to your Git repository.
The YAML
The YAML, a crucial component of Azure Pipelines, is a file that contains the configuration for your pipeline. It's where you declare variables, stages, and tasks that will be executed during the pipeline run.
A Pipeline file starts with a variables block, where you declare variables that will be used throughout the pipeline. In this example, variables are stored in Azure DevOps variable groups, which correspond to different environments like dev and pre-prod.
The variables needed in Variable groups are azure_subscription_id, azure_service_connection_name, resource_group_name, and azure_data_factory_name. Two additional variables, adf_code_path and adf_package_file_path, are also present in the pipeline. adf_code_path is the path in the repo where your ADF code is stored, while adf_package_file_path is the path in the repo where the package.json file is present.
Before creating the pipeline, you'll need to create a package.json file, which contains details to obtain the ADFUtilities package. This file is created in a 'build' folder, and the package.json file contains a specific code block that will be used by the NPM package to find the ADFUtilities package.
Here are the key variables you'll need to declare in your Pipeline file:
- azure_subscription_id
- azure_service_connection_name
- resource_group_name
- azure_data_factory_name
- adf_code_path
- adf_package_file_path
ADF Modes
ADF Modes are a crucial aspect of working with Factories. ADF consists of two modes: Live Mode and Git Mode.
In Git Mode, you'll need to select two branches: Collaboration branch and Publish branch. The Collaboration branch is where all feature branches are merged, while the Publish branch is where changes, including auto-generated ARM templates, get published.
The Publish branch is automatically created as 'adf_publish' by default. You'll also find a corresponding 'adf' folder in your repository, which contains all the ADF resources.
Here are the two modes in a nutshell:
Adding Custom Parameters in ARM Templates
Adding custom parameters in ARM templates can be a bit tricky, but don't worry, I've got you covered. You can't override a parameter if it's not present in the ARM template parameters file.
To get the parameter in the ARM template parameters file, you need to edit the parameter configuration in the ADF Portal. This involves navigating to the Manage tab and clicking on Edit parameter configuration to load the JSON file.
You'll need to go to the specific section where your parameter is located, such as the Linked Service section. From there, you can add the parameter under the typeProperties section.
Here are the steps in more detail:
- Navigate to ADF Portal and go to Manage Tab.
- Under the ARM Template section, Click on Edit parameter configuration to load the JSON file.
- Go to the required section, for example, “Microsoft.DataFactory/factories/linkedServices”.
- Under typeProperties, add the parameter you want to come in ARM Template parameter file.
- Click on Ok, which will generate a file called “arm-template-parameters-definition.json” in the repo where ADF code is present.
- Run the Pipeline again, and you will see the new parameter in the Template Parameter file.
This process is a bit tedious, but it's worth it in the end. The new parameter will be included in the ARM template parameters file, making it easier to deploy to new environments.
Create a
To create a Data Factory, you'll need to follow these steps. First, ensure that GIT is Enabled while creating the new Data Factory from the pre-requisites section.
Navigate to the newly created DEV Data Factory in the desired resource group, and click Author & Monitor to launch the Data Factory authoring UI.
To create a pipeline, click the pencil icon, then click the plus icon, and finally click Pipeline from the list of options.
Here's a list of the necessary details to enter when creating a Data Factory:
- GIT account details
- Repo details
- Resource group details
Note that Azure Data Factory names must be unique across all of Azure, so you might need to add a random number(s) to the end of your initials for it to be unique.
You can also create a feature branch in ADF Portal by following the development flow for ADF:
1. Create a feature branch from your collaboration branch.
2. Develop and manually test your changes in the feature branch.
3. Create a PR from the feature branch to the collaboration branch in GitHub.
4. Once the PR is merged, the changes will be deployed in ADF Dev Environment.
To publish your changes, click the Publish button, and ADF will create a new branch called adf_publish inside your repository and publish the changes to ADF directly.
CI/CD
CI/CD is a crucial part of Azure Data Factory (ADF) development, ensuring that changes are deployed smoothly and efficiently. The recommended CI/CD flow for ADF involves creating a pipeline in Azure DevOps that builds and deploys ADF resources.
Each user makes changes in their private branches, and then creates a pull request to merge the changes into the master branch. The Azure DevOps pipeline build is triggered every time a new commit is made to master, validating the resources and generating an ARM template as an artifact if validation succeeds.
The DevOps Release pipeline is configured to create a new release and deploy the ARM template each time a new build is available. This ensures that changes are automatically deployed to the development environment.
Here's a step-by-step breakdown of the CI/CD process:
- Create a pipeline in Azure DevOps that builds and deploys ADF resources.
- Each user makes changes in their private branches and creates a pull request to merge the changes into the master branch.
- The Azure DevOps pipeline build is triggered every time a new commit is made to master.
- The pipeline validates the resources and generates an ARM template as an artifact if validation succeeds.
- The DevOps Release pipeline is configured to create a new release and deploy the ARM template each time a new build is available.
By following this CI/CD flow, developers can ensure that changes are deployed smoothly and efficiently, reducing the risk of errors and improving overall productivity.
Sources
- GitHub Repository (github.com)
- Azure Data Factory documentation (microsoft.com)
- Quickstart: Create an Azure data factory using the Azure Data Factory UI (microsoft.com)
- Management Hub in Azure Data Factory (microsoft.com)
- azure.datafactory.tools (github.com)
- Microsoft docs (microsoft.com)
- npm package (npmjs.com)
- Git integration (microsoft.com)
- Here is the link to full YAML Pipeline. (github.com)
- Azure Data Factory – CI/CD [Part 1] (linkedin.com)
Featured Images: pexels.com