
Azure Polybase is a powerful tool for loading data into Azure, and it's surprisingly easy to use. You can use it to load data from various sources, including CSV, JSON, and Avro files.
To get started with Azure Polybase, you'll need to create a Polybase database in your Azure Synapse Analytics workspace. This will give you a dedicated space to load and process your data.
Polybase supports a wide range of data types, including integer, string, and date/time fields. This means you can load data from various sources without having to worry about data type conversions.
Preparing Data
You'll need to prepare and clean the data in your storage account before loading it into dedicated SQL pool. This can be done while your data is in the source or after it's in Azure Storage.
It's easiest to work with the data as early in the process as possible, making it a good idea to perform data preparation while exporting the data to text files. Getting data out of your source system depends on the storage location, and the goal is to move the data into PolyBase supported delimited text files.
You'll need to extract the source data into text files, which can be done using PolyBase supported delimited text files. PolyBase loads data from UTF-8 and UTF-16 encoded delimited text files, making it essential to ensure your data is in one of these formats.
Here are some specific data types that PolyBase can load from text files:
To format text files, you'll need to align the rows of the text files with the external table and file format definition. The data in each row of the text file must align with the table definition.
Loading Data
Loading data into Azure Synapse Analytics using PolyBase is a straightforward process. You can load data from Azure Blob storage or Azure Data Lake Store using PolyBase.
To load data into dedicated SQL pool staging tables, use PolyBase with T-SQL, which gives you the most control over the loading process. However, this method requires defining external data objects, which can be done using Azure Data Factory, SSIS, or Azure functions.
You can also use PolyBase with SSIS, which works well when your source data is in SQL Server, or with Azure Data Factory (ADF), which defines a pipeline and schedules jobs. Another option is to use PolyBase with Azure Databricks, which transfers data from an Azure Synapse Analytics table to a Databricks dataframe and/or writes data from a Databricks dataframe to an Azure Synapse Analytics table.
Before loading data, you need to define external tables in your data warehouse using the CREATE EXTERNAL TABLE syntax, specifying the data source, file format, and table definitions.
Here are the tools and services you can use to move data to Azure Storage:
- Azure ExpressRoute service enhances network throughput, performance, and predictability.
- AzCopy utility moves data to Azure Storage over the public internet.
- Azure Data Factory (ADF) has a gateway that you can install on your local server.
Load into Dedicated SQL Pool Staging Tables
Loading data into dedicated SQL pool staging tables is an essential step in the data loading process. It allows you to handle errors without interfering with production tables.
You should load data into a staging table, as it gives you the opportunity to use SQL pool's built-in distributed query processing capabilities for data transformations before inserting the data into production tables.
Staging tables are designed to hold data temporarily before it's loaded into production tables. They're a crucial part of the data loading process, and using them can help you avoid errors and improve data quality.
To load data into a staging table, you can use PolyBase, which is a feature of dedicated SQL pool that allows you to load data from various sources, including Azure Blob storage and Azure Data Lake Store.
Here are some options for loading data with PolyBase:
- PolyBase with T-SQL works well when your data is in Azure Blob storage or Azure Data Lake Store.
- PolyBase with SSIS works well when your source data is in SQL Server.
- PolyBase with Azure Data Factory (ADF) is another orchestration tool.
- PolyBase with Azure Databricks transfers data from an Azure Synapse Analytics table to a Databricks dataframe and/or writes data from a Databricks dataframe to an Azure Synapse Analytics table using PolyBase.
These options give you flexibility and control over the loading process, and can be used depending on your specific needs and requirements.
CSV in Azure Blob Storage
Loading CSV data from Azure Blob Storage into Azure Synapse Analytics is a straightforward process. You can use PolyBase to query your CSV data as if it were within a local table.
To start, you need to configure an external file format for CSV. This involves specifying the reading and interpretation parameters for your CSV files, such as the field terminator and string delimiters. You can do this using the CREATE EXTERNAL FILE FORMAT command, as shown in the example: CREATE EXTERNAL FILE FORMAT CSVFileFormat WITH (FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR = ‘,’, STRING_DELIMITER = ‘“‘, FIRST_ROW = 2 — Assuming the first row contains headers ));
The field terminator is the character that marks the boundary between each field in a single record. For CSV files, this is typically a comma. However, it could be another character if you're working with a different delimiter.
You can also specify the string delimiters, which are used to encapsulate field values that contain the field terminator character or other special characters. In CSV files, string delimiters are usually double quotes.
Here are some common settings for the CREATE EXTERNAL FILE FORMAT command:
Once you've configured the external file format, you can create an external table that mirrors your CSV file's schema, linking to both the data source and file format. This will allow you to query your CSV data as if it were within a local table.
Azure Polybase Features
Polybase is a powerful feature that allows users to seamlessly integrate and analyze structured and unstructured data from various external sources.
It enables users to create external data sources and external tables to access and query data from external sources, such as Hadoop-based data stores, and supports parallel loading of data and dynamic data masking for improved efficiency, security, and compliance.
Polybase integrates with Azure services such as Azure Data Lake Store, Azure Blob Storage, and Azure HDInsight to provide a more comprehensive view of data and analysis capabilities.
Here are some key features of Polybase:
- Simplified data integration: Polybase allows users to easily access and query data from disparate systems without the need for complex ETL processes.
- Support for multiple input formats: Polybase supports multiple input formats, including delimited text files, Hive tables, HBase tables, and ORC files.
- Query external data sources: Polybase allows users to query external data sources such as Hadoop and other data stores.
- Distributed queries: Polybase enables users to perform distributed queries across both structured and unstructured data sources.
- Supports query pushdown: Polybase supports query pushdown, where it pushes processing tasks down to the source system rather than executing all tasks within Polybase.
Non-Loading Options
If your data is not compatible with PolyBase, you can use bcp or the SQLBulkCopy API. BCP loads directly to dedicated SQL pool without going through Azure Blob storage, and is intended only for small loads.
The load performance of these options is slower than PolyBase. This means you should only use bcp or SQLBulkCopy for small loads.
Enabled Instance
An enabled instance is a prerequisite for using Azure Polybase. It refers to a SQL Server or Azure SQL Data Warehouse instance that has the Polybase feature enabled.
Polybase is a powerful feature that allows users to integrate and analyze structured and unstructured data from various external sources. It provides users with the ability to create external data sources and external tables to access and query data from external sources, such as Hadoop-based data stores.
To activate Polybase on an Azure SQL Data Warehouse instance, you'll need to follow a few simple steps. Connect to your Azure SQL Data Warehouse master database with SSMS and right-click on the master database to initiate a "scriptable install".
A list of installable components will be displayed, and you should choose all options to enable Polybase. Then, click the "next" button to proceed with the installation, and the script will execute in SSMS.
Polybase also supports parallel loading of data and dynamic data masking for improved efficiency, security, and compliance. Additionally, it integrates with Azure services such as Azure Data Lake Store, Azure Blob Storage, and Azure HDInsight to provide a more comprehensive view of data and analysis capabilities.
Clusters vs. Instances
When comparing Azure Polybase features in big data clusters versus stand-alone instances, it's essential to consider the capabilities of each.
In SQL Server 2019 (15.x) big data clusters, you can create external data sources for SQL Server, Oracle, Teradata, and MongoDB, as well as Hadoop data sources and Azure Blob Storage.
SQL Server 2019 (15.x) big data clusters also support scale-out query execution, which can be a significant performance boost for large-scale data processing.
Here's a comparison of PolyBase features in big data clusters and stand-alone instances:
In stand-alone instances, you can create external data sources using a compatible third-party ODBC driver, which can be useful for connecting to specific databases.
The Power of
Polybase is a game-changer for managing and analyzing data across different storage systems. It allows you to query data without having to move it from where it's stored, saving both time and storage space.
With Polybase, you can access and analyze data from disparate systems without the need for complex ETL processes. This is made possible through real-time data integration, which enables you to access and combine data from different sources in real-time.
One of the key benefits of Polybase is its ability to support multiple input formats. This includes delimited text files, Hive tables, HBase tables, and ORC files, making it easy to work with different data formats and extract data from numerous external data sources.
Polybase also enables users to perform distributed queries across both structured and unstructured data sources. This means you can query data in Azure Data Lake Store or Hadoop Distributed File System (HDFS) alongside data stored in traditional SQL Server databases, all within a single query.
Here are some of the key features of Polybase:
- Query external data sources
- Supports multiple input formats
- Distributed queries
- Supports query pushdown
- Dynamic data masking
These features make Polybase a powerful tool for managing and analyzing data across different storage systems.
Tables
Tables are a fundamental concept in Azure Polybase, and understanding how they work is crucial for getting the most out of this powerful tool. Polybase uses external tables to define and access data in Azure Storage.
An external table is similar to a database view and contains the table schema and points to data stored outside the data warehouse. You'll need to specify the data source, text file format, and table definitions when defining external tables.
To create an external table, you'll need to use the CREATE EXTERNAL TABLE syntax, which involves specifying the data source, file format, and table definitions. This is a critical step in getting started with Polybase.
External tables support a wide range of data types, including string, integer, decimal, and date/time data types. This makes it easy to work with data from different sources.
Here's a quick rundown of the steps involved in creating an external table:
- CREATE EXTERNAL DATA SOURCE
- CREATE EXTERNAL FILE FORMAT
- CREATE EXTERNAL TABLE
These steps will help you create an external table that references data stored in an external data source. With external tables, you can access and manipulate data without physically moving it, which can save time and resources.
Frequently Asked Questions
What is the difference between linked server and Polybase?
Polybase outperforms Linked Server for complex queries and large data volumes, but they serve different purposes in data integration. To determine which is best for your needs, consider the complexity of your queries and the size of your data.
What is the difference between external table and polybase?
An external table defines the data structure and location, while Polybase is a technology that retrieves data from external tables using various file formats. Polybase builds upon external tables to provide a more comprehensive data access solution.
Sources
- https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview
- https://medium.com/@meruert.sm/polybase-in-azure-05e28f347c1d
- https://azuretrainings.in/what-is-polybase-in-azure/
- https://learn.microsoft.com/en-us/sql/relational-databases/polybase/polybase-faq
- https://learn.microsoft.com/en-us/sql/relational-databases/polybase/polybase-t-sql-objects
Featured Images: pexels.com