Azure Data Factory is a cloud-based data integration service that allows you to create, schedule, and manage data pipelines. It enables you to combine data from various sources, transform it, and load it into a data warehouse or other destinations.
Data Factory supports a wide range of data sources, including Azure Blob Storage, Azure SQL Database, and on-premises SQL Server databases. You can also use it to integrate with other Azure services, such as Azure Databricks and Azure Synapse Analytics.
One of the key benefits of Azure Data Factory is its ability to handle large volumes of data and scale automatically to meet changing demands. This makes it an ideal choice for big data and analytics workloads.
With Data Factory, you can create complex data pipelines using a visual interface or code, giving you flexibility and control over your data integration processes.
Get Started
Getting started with Azure Data Factory is easier than you think. You can find plenty of resources on Azure Docs to help you get started.
Azure Data Factory has its own YouTube Channel where you can learn from the technical community. The channel is a great place to find tutorials and guides.
The Azure Data Factory Learning Path is part of Azure Learn, which offers a structured way to learn about Azure Data Factory. This learning path is a great resource for beginners.
ACG's Developing a Pipeline in Azure Data Factory Hands-on Lab is another great resource to get hands-on experience with Azure Data Factory.
Architecture
Azure Data Factory offers a managed Apache Spark service that takes care of code generation and maintenance, allowing data engineers to focus on more complex tasks. This service simplifies the process of building ETL and ELT processes, making it easier to transform data.
With Azure Data Factory, you can prepare data and construct processes without writing code, thanks to its code-free ETL capabilities. This feature, combined with intelligent intent-driven mapping, automates copy activities and speeds up the transformation process.
Azure Data Factory provides more than 90 built-in connectors to acquire data from various sources, including big data sources, enterprise data warehouses, SaaS apps, and Azure data services. This means you can easily ingest data from diverse and multiple sources, without the need for multiple solutions.
Here are some of the built-in connectors available in Azure Data Factory:
- Amazon Redshift
- Google BigQuery
- Hadoop Distributed File System (HDFS)
- Oracle Exadata
- Teradata
- Salesforce
- Marketo
- ServiceNow
- Azure data services
By using Azure Data Factory, you can also integrate and transform data in the familiar Data Factory experience within Azure Synapse Pipelines. This allows you to work with data flows within the Azure Synapse studio, transforming and analyzing data code-free.
Pipeline Creation
To create a pipeline in Azure Data Factory, you can navigate to the Author tab in Data Factory Studio and click the plus sign to create a new pipeline. You can also create a pipeline in Synapse Studio by navigating to the Integrate tab and clicking the plus sign. The pipeline editor canvas will display all available activities that can be used within the pipeline.
The pipeline editor canvas is divided into four main sections: activities, pipeline configurations, pipeline properties, and pipeline configurations pane. The activities section is where you can add and configure individual activities, such as Copy, Append Variable, and Execute Pipeline.
Here are the main sections of the pipeline editor canvas:
- Activities: where you can add and configure individual activities
- Pipeline configurations: where you can configure pipeline parameters, variables, and output
- Pipeline properties: where you can configure pipeline name, description, and annotations
- Pipeline configurations pane: where you can configure pipeline configurations, such as concurrency and annotations
Features
As you start creating your pipeline, you'll want to know about the features that make Azure Data Factory a powerful tool. Data Compression is a feature that optimizes bandwidth usage during data copying.
This means you can transfer large amounts of data more efficiently, saving you time and resources.
Linked services are used to represent a data store, such as a SQL Server database or Azure blob storage account. For a list of supported data stores, see the copy activity article.
You can also use linked services to represent a compute resource that hosts the execution of an activity, like an HDInsight Hadoop cluster.
Data Preview and Validation is another feature that helps you ensure data is copied correctly. This feature provides tools for previewing and validating data during the Data Copy activity.
Custom Event Triggers allow you to automate data processing by executing a certain action when a certain event occurs.
Here are some of the supported data sources that Azure Data Factory provides broad connectivity support for:
Customizable Data Flows allow you to create custom actions or steps for data processing. This feature gives you flexibility and control over your data pipeline.
Integrated Security features, such as Entra ID integration and role-based access control, help protect your data by controlling access to dataflows.
Triggers
Triggers are the heartbeat of pipeline creation, determining when a pipeline execution needs to be kicked off. There are different types of triggers, including Scheduler triggers and manual triggers.
A Scheduler trigger allows pipelines to be triggered on a wall-clock schedule, while a manual trigger triggers pipelines on-demand. To have your trigger kick off a pipeline run, you must include a pipeline reference of the particular pipeline in the trigger definition.
Multiple triggers can kick off a single pipeline, and the same trigger can kick off multiple pipelines. Once the trigger is defined, you must start the trigger to have it start triggering the pipeline.
Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off. They come in different types for different types of events, giving you flexibility in automating data processing.
Custom event triggers are available in Azure Data Factory, allowing you to automate data processing using custom event triggers. This feature lets you automatically execute a certain action when a certain event occurs.
Creating a Pipeline with UI
Creating a pipeline with UI is a straightforward process. In Data Factory Studio, you can create a new pipeline by navigating to the Author tab, clicking the plus sign, and choosing Pipeline from the menu, then Pipeline again from the submenu.
To create a pipeline, you'll need to access the pipeline editor canvas, where activities will appear when added to the pipeline. The pipeline configurations pane is also essential, as it includes parameters, variables, general settings, and output. Additionally, the pipeline properties pane allows you to configure the pipeline name, optional description, and annotations.
Here are the key steps to create a pipeline in Data Factory Studio:
- Navigate to the Author tab and click the plus sign.
- Choose Pipeline from the menu, then Pipeline again from the submenu.
- Access the pipeline editor canvas and pipeline configurations pane.
- Configure the pipeline properties pane with the pipeline name, description, and annotations.
In Synapse Studio, the process is similar. You can create a new pipeline by navigating to the Integrate tab, clicking the plus sign, and choosing Pipeline from the menu. Synapse will then display the pipeline editor, where you can find the pipeline configurations pane and pipeline properties pane.
Creating a pipeline with UI is a user-friendly process that allows you to visually design and configure your pipeline. By following these steps, you can create a pipeline that meets your needs and automates your data processing tasks.
Using SSIS
You can deploy, manage, and run SSIS packages in managed Azure SSIS Integration Runtimes through Azure Data Factory. This allows you to leverage the strengths of both tools.
Azure Data Factory offers the ability to run SSIS packages, making it a great option for hybrid on-premise and Azure solutions. In fact, it's a recommended choice for such scenarios.
If you're working with big data, SSIS isn't the best tool for the job. It's limited to small to medium-sized data sets and can't handle big data volumes.
Here's a breakdown of when to use SSIS:
In addition, SSIS is limited to running once per minute, which might not be suitable for close to real-time data processing.
Pipeline Management
To create a new pipeline in Azure Data Factory, you can navigate to the Author tab in Data Factory Studio and click the plus sign to create a new pipeline. Alternatively, you can go to the Integrate tab in Synapse Studio and click the plus sign to create a pipeline.
The pipeline editor canvas is where activities will appear when added to the pipeline. You can also configure pipeline properties, such as the pipeline name, description, and annotations, in the pipeline properties pane.
Here are the key components of a pipeline:
- Name: The name of the pipeline, which should represent the action it performs.
- Description: A text description of what the pipeline is used for.
- Activities: The activities section can have one or more activities defined within it.
- Parameters: The parameters section can have one or more parameters defined within the pipeline, making it flexible for reuse.
Pipeline runs are instances of pipeline execution, typically instantiated by passing arguments to the parameters defined in the pipeline.
Scheduling Pipelines
Scheduling pipelines is an essential aspect of pipeline management. You can schedule pipelines using triggers, which are essentially the engine that drives pipeline execution. There are two types of triggers: Scheduler triggers, which allow pipelines to be triggered on a wall-clock schedule, and manual triggers, which trigger pipelines on-demand.
To have your trigger kick off a pipeline run, you must include a pipeline reference of the particular pipeline in the trigger definition. This is because pipelines and triggers have an n-m relationship, meaning multiple triggers can kick off a single pipeline, and the same trigger can kick off multiple pipelines.
For example, if you have a Scheduler trigger called "Trigger A" that you want to kick off your pipeline "MyCopyPipeline", you would define the trigger as follows:
- Define the trigger with the pipeline reference "MyCopyPipeline"
- Start the trigger to have it start triggering the pipeline
Here are the key steps to schedule a pipeline:
1. Define the trigger with the pipeline reference
2. Choose the type of trigger (Scheduler or manual)
3. Set the trigger schedule (if using Scheduler trigger)
4. Start the trigger
By following these steps, you can schedule your pipelines to run automatically at set intervals or on-demand, depending on your needs.
Activity Policy JSON
Activity Policy JSON is a crucial aspect of pipeline management. It allows you to define the behavior of your activities in a JSON file.
The timeout value specifies the maximum amount of time an activity can run, with a default of 12 hours and a minimum of 10 minutes. This ensures that activities don't run indefinitely and can be properly monitored.
The retry policy can be customized to suit your needs. You can set the maximum number of retry attempts to a specific integer value, with a default of 0. This helps prevent activities from getting stuck in an infinite loop.
The retry interval can be adjusted in seconds, with a default of 30 seconds. This allows you to fine-tune the timing of your retries.
If you have secure output, it won't be logged for monitoring purposes. This can be useful for sensitive data. The secureOutput property is a boolean value that defaults to false.
Here's a summary of the key properties:
Control Flow Activities
Control flow activities are the building blocks of a pipeline's logic. They allow you to chain activities in a sequence, branch out, and define parameters at the pipeline level.
You can use the If Condition Activity to branch based on a condition that evaluates to true or false. This activity provides the same functionality as an if statement in programming languages, evaluating a set of activities when the condition is true and another set of activities when the condition is false.
The For Each Activity defines a repeating control flow in your pipeline, allowing you to iterate over a collection and execute specified activities in a loop. This activity is similar to the Foreach looping structure in programming languages.
You can use the Wait Activity to pause the pipeline's execution for a specified time before continuing with subsequent activities. This is useful when you need to wait for a certain condition to be met or for a specific time period to pass.
The following control flow activities are supported:
CI/CD and Publish
CI/CD and Publish is a game-changer for pipeline management. Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub.
With this feature, you can incrementally develop and deliver your ETL processes before publishing the finished product. This allows you to test and refine your pipelines in a controlled environment, reducing the risk of errors and downtime.
You can load the refined data into Azure Data Warehouse, Azure SQL Database, Azure Cosmos DB, or any other analytics engine your business users can point to from their business intelligence tools.
Frequently Asked Questions
Is Azure Data Factory an ETL?
No, Azure Data Factory is not an ETL (Extract, Transform, Load) tool, although it does support data transformation. Instead, it's a big data processing platform that enables more complex data integration and workflow scenarios.
What is Azure Databricks vs data Factory?
Azure Databricks is ideal for big data processing, advanced analytics, and machine learning, while Azure Data Factory focuses on data integration, migration, and orchestration. Choose Databricks for complex data analysis and Data Factory for streamlined data movement and management.
What is the equivalent of Azure Data Factory?
The equivalent of Azure Data Factory is SnapLogic Intelligent Integration Platform (IIP), which offers a comprehensive alternative for data integration and processing. Other notable alternatives include AWS Glue, dbt, Matillion, and IBM DataStage.
What is Azure Data Factory vs data Lake?
Azure Data Factory is a data transformation and loading service, while Azure Data Lake is a secure and scalable data storage solution for large workloads. Together, they help manage and store data for optimal performance and security.
What is Azure Data Factory Studio?
Azure Data Factory Studio is a visual tool that allows developers to design and build data transformations without writing code. It's a code-free environment that accelerates data transformation development.
Sources
- Azure Data Factory - Data Integration Service (microsoft.com)
- Pipelines and activities - Azure Data Factory & ... (microsoft.com)
- Introduction to Azure Data Factory (microsoft.com)
- Developing a Pipeline in Azure Data Factory Hands-on Lab (acloudguru.com)
- Azure Data Factory Learning Path (aka.ms)
- Azure Data Factory YouTube Channel (aka.ms)
- Azure Data Factory (element61.be)
Featured Images: pexels.com