Data lakes are massive storage systems that hold vast amounts of raw, unprocessed data from various sources.
This data is often scattered and disorganized, making it difficult to access and utilize.
Data lake indexing is a technique that helps to organize and structure this data, making it more accessible and easier to query.
By creating an index, you can significantly reduce the time it takes to search and retrieve specific data, from hours or even days to just a few seconds.
Data lake indexing can be applied to various types of data, including structured, semi-structured, and unstructured data.
This flexibility is one of the key benefits of data lake indexing, allowing you to manage a wide range of data sources and formats in a single system.
Prerequisites
To get started with data lake indexing, you'll need to meet some prerequisites. ADLS Gen2 with hierarchical namespace enabled is a must-have, which you can set up through Azure Storage.
To enable hierarchical namespace, simply select the option when setting up your storage account. This will allow you to organize your files into a hierarchy of directories and nested subdirectories.
ADLS Gen2 access tiers include hot, cool, and archive, but only hot and cool can be accessed by search indexers. If you have blobs containing text, you're good to go, but if you have binary data, you can include AI enrichment for image analysis.
However, blob content can't exceed the indexer limits for your search service tier. To access your Azure Storage content, you'll need read permissions, which can be granted through a "full access" connection string or Azure roles.
To use Azure roles, make sure the search service managed identity has Storage Blob Data Reader permissions. Formulating REST calls can be a bit tricky, but using a REST client can help.
Data Lake Indexing Basics
A data lake is a central repository that stores and processes raw data in any format, allowing for flexibility and scalability. Data lakes are optimized for flexibility, enabling you to store data in its original format and run various types of analyses without expensive pre-processing.
To get the most out of your data lake, you'll want to catalog and index your data to gain an overview of what's available. This involves tracking the different data streams flowing into your data lake and understanding what kind of data is being stored.
Indexing in a data lake involves specifying the inputs, parameters, and properties controlling run-time behaviors. This includes setting the batch size, which can be adjusted to optimize resource utilization. For example, if the default batch size of 10 documents is overwhelming your resources, you can increase it to improve performance.
Here are some key factors to consider when configuring your indexer:
- IndexedFileNameExtensions: specify the file extensions to index, separated by commas (e.g., .pdf,.docx)
- ExcludedFileNameExtensions: specify the file extensions to exclude from indexing, separated by commas (e.g., .png,.jpeg)
- DataToExtract: control which parts of the blobs are indexed (e.g., contentAndMetadata)
- ParsingMode: specify how blobs should be mapped to search documents (e.g., default, plain text, JSON documents, or CSV files)
By understanding these basics of data lake indexing, you can set up an efficient and effective indexing strategy that meets your specific needs.
What Is?
A data lake is a central repository that stores and processes raw data in any format, regardless of size or format. It's like a lake with many rivers flowing into it, where data streams from various sources are stored together without strict separation or pre-defined structuring.
Data lakes are optimized for flexibility, allowing you to store data in any format and run various types of analyses without expensive pre-processing. They're a reaction to data warehouses, which require data to be in a specific structured format and often store data in proprietary formats, leading to vendor lock-in.
A data lake gives you four key benefits:
- Free data movement from multiple sources, in its original format, possibly in real-time
- Cataloging and indexing to give you an overview of what kind of data is in your data lake
- Access from different teams and roles for using the data downstream
- The ability to perform data science and machine learning analyses
A local directory of Parquet files can be considered a data lake, as can an S3 bucket containing many different file types with data formatted as tables, JSON, free text, images, video, etc.
Table Basics
Delta tables are a type of data storage that supports operations like insert, update, delete, and merge, making them a versatile choice for data lake management.
Delta tables maintain the advantages of traditional Parquet-based data lakes, ensuring atomicity, consistency, isolation, and durability (ACID) properties in data operations through delta commits.
Delta tables address the lack of transactional support in traditional data lakes, providing a unified data management platform that seamlessly integrates batch and streaming data processing.
Delta tables allow you to evolve schemas over time, making them a robust solution for modern data lake architectures.
Configuring Indexing
Configuring indexing is a crucial step in setting up a data lake indexing solution. You have several options for selective processing, including placing blobs in a virtual folder, including or excluding blobs by file type, and including or excluding arbitrary blobs.
To determine which blobs to index, review your source data and consider whether any changes should be made up front. You can use the "query" parameter in an indexer data source definition to specify a virtual folder, which will only index blobs in that folder.
To exclude blobs by file type, use the supported document formats list to determine which blobs to exclude. For example, you might want to exclude image or audio files that don't provide searchable text. This capability is controlled through configuration settings in the indexer.
You can also exclude arbitrary blobs by adding metadata properties and values to blobs in Blob Storage. The "AzureSearch_Skip" property, for example, instructs the blob indexer to completely skip the blob.
If you don't set up inclusion or exclusion criteria, the indexer will report an ineligible blob as an error and move on. You can specify error tolerance in the indexer configuration settings to prevent this.
To configure indexing, create or update an indexer by giving it a name and referencing the data source and target index. You can specify the inputs, parameters, and properties controlling run-time behaviors, including which parts of a blob to index.
Here are some key configuration settings to consider:
You can also specify field mappings to control how data is mapped from the source to the target index. For example, you might want to map a specific metadata property to a field in the target index.
By carefully configuring indexing, you can ensure that your data lake indexing solution is efficient, effective, and scalable.
Indexable Data
Indexable data refers to the types of files and formats that can be indexed in a data lake. You can index blobs in a container, but it's essential to determine which blobs to index first. This can be done by reviewing your source data and making changes up front, such as placing blobs in a virtual folder or excluding certain file types.
The ADLS Gen2 indexer can extract text from various document formats, including CSV, EML, EPUB, GZ, HTML, JSON, KML, Microsoft Office formats, Open Document formats, PDF, Plain text files, RTF, XML, and ZIP. These formats are supported out of the box, and you don't need to configure anything extra.
To make the most of indexing, it's crucial to understand how open table formats work. These formats use a layer of metadata over the data files, which can be used to reconstruct the table and control the schema. This abstraction creates a more efficient way of storing and retrieving data, but it still requires a compute engine to interpret the metadata files and update the data files accordingly.
Determine Indexable Blobs
To determine which blobs to index, you should review your source data first. This will help you decide if any changes need to be made up front. An indexer can only process content from one container at a time by default.
You have several options for more selective processing. One way is to place blobs in a virtual folder. An indexer data source definition includes a "query" parameter that can take a virtual folder. If you specify a virtual folder, only those blobs in the folder are indexed.
You can also include or exclude blobs by file type. The supported document formats list can help you determine which blobs to exclude. For example, you might want to exclude image or audio files that don't provide searchable text.
Here's a list of ways to include or exclude blobs:
- Place blobs in a virtual folder.
- Include or exclude blobs by file type.
- Include or exclude arbitrary blobs by adding metadata properties and values to blobs in Blob Storage.
If you don't set up inclusion or exclusion criteria, the indexer will report an ineligible blob as an error and move on. This can lead to processing stopping if enough errors occur. You can specify error tolerance in the indexer configuration settings.
Open Table Formats
Open table formats are a way to store data in a columnar file format, such as parquet, and manage a layer of metadata over the data files. This metadata layer holds information about the changes over time that are saved in the data files, allowing us to reconstruct the table as it is at its current point in time.
The metadata layer is made up of metadata files that describe the data files, and can be used to control the schema of the table. For example, if a file includes a few deletions, the metadata file would tell us that.
Most open table formats started with Apache Spark as the compute engine, but other compute engines have evolved to support them. This means that they created the logic of running and interpreting those metadata files.
A metastore can make open table formats accessible by representing the logic of a table into a database that provides an SQL interface. This allows us to work with open table formats like Hudi, Delta, or Iceberg with an SQL interface.
Here are some key benefits of using a metastore with open table formats:
- Provides an abstraction of a table
- Allows for schema control
- Enables SQL interface
- Supports multiple open table formats
By using a metastore with open table formats, we can gain a general interface of SQL and work with different formats in a consistent way.
Indexing Process
An indexer can index content from one container at a time, and by default, all blobs in the container are processed.
You have several options for more selective processing, such as placing blobs in a virtual folder or including or excluding blobs by file type.
To exclude blobs by file type, you can specify a comma-separated list of file extensions under the "configuration" settings in the indexer.
For example, you can specify ".pdf,.docx" under "indexedFileNameExtensions" to include only PDF and Word documents.
You can also exclude image or audio files that don't provide searchable text by specifying ".png,.jpeg" under "excludedFileNameExtensions".
If you want to skip a specific blob for whatever reason, you can add the metadata property "AzureSearch_Skip" with a value of "true" to the blob.
This instructs the blob indexer to completely skip the blob and neither metadata nor content extraction is attempted.
An indexer typically creates one search document per blob, where the text content and metadata are captured as searchable fields in an index.
If blobs are whole files, you can potentially parse them into multiple search documents, such as parsing rows in a CSV file to create one search document per row.
Here are some common file types and their corresponding indexing settings:
You can also set "dataToExtract" to control which parts of the blobs are indexed, such as setting it to "contentAndMetadata" to index both content and metadata.
You can also set "parsingMode" to control how blobs are parsed into search documents, such as setting it to "default" to use the default parsing mode.
An indexer runs automatically when it's created, but you can prevent this by setting "disabled" to true.
To control indexer execution, you can run an indexer on demand or put it on a schedule.
Execution history contains up to 50 of the most recently completed executions, which are sorted in the reverse chronological order so that the latest execution comes first.
Indexing Options
Indexing options are crucial in data lake indexing. You have several options to choose from when deciding which blobs to index.
You can place blobs in a virtual folder to index only those blobs. This is done by specifying a virtual folder in the indexer data source definition. Alternatively, you can include or exclude blobs by file type. The supported document formats list can help you determine which blobs to exclude.
For example, you might want to exclude image or audio files that don't provide searchable text. This capability is controlled through configuration settings in the indexer.
If you want to skip a specific blob for whatever reason, you can add the "AzureSearch_Skip" property with a value of "true" to the blob in Blob Storage. This will instruct the blob indexer to completely skip the blob.
You can also use the "AzureSearch_SkipContent" property to skip content and extract just the metadata. This is equivalent to setting the "dataToExtract" to "allMetadata" in the configuration settings.
Here are some options for controlling which blobs are indexed based on file type:
You can specify error tolerance in the indexer configuration settings to prevent processing from stopping if errors occur.
Performance
Delta Lake optimizes your queries by storing file paths in a separate transaction log and storing metadata in the transaction log. This approach allows for faster query execution times compared to a regular data lake.
Delta Lake's performance advantage lies in its ability to prioritize partial reads via file-skipping and co-locating similar data to enable better file skipping. This means that queries can skip over unnecessary files and focus on the relevant data, reducing execution times.
Implementing indexes is a crucial step in improving query performance, but it's essential to conduct performance tests to evaluate the impact on query execution times. This involves considering factors such as query complexity, dataset size, and concurrent user loads.
Patented autonomous indexing technology can accelerate queries by leveraging smart indexing and caching. This technology optimizes query execution times by automatically adjusting indexing strategies as the data changes.
Delta Lake's performance features can be summarized as follows:
- Storing file paths in a separate transaction log
- Storing metadata in the transaction log
- Prioritizing partial reads via file-skipping
- Co-locating similar data to allow for better file skipping
Limitations and Considerations
When working with ADLS Gen2 indexers, it's essential to be aware of their limitations. ADLS Gen2 indexers cannot utilize container level SAS tokens for enumerating and indexing content from a storage account.
This is because the indexer makes a check to determine if the storage account has hierarchical namespaces enabled by calling the Filesystem - Get properties API. For storage accounts where hierarchical namespaces are not enabled, customers are instead recommended to utilize blob indexers to ensure performant enumeration of blobs.
If you're mapping the property metadata_storage_path to be the index key field, keep in mind that blobs are not guaranteed to get reindexed upon a directory rename.
To reindex the blobs that are part of the renamed directories, update the LastModified timestamps for all of them.
Handle Errors
Handling errors is a crucial aspect of indexing, and Azure's blob indexer provides several properties to control its response to errors. By default, the indexer stops as soon as it encounters a blob with an unsupported content type.
The five indexer properties that control the indexer's response to errors are "maxFailedItems", "maxFailedItemsPerBatch", "failOnUnsupportedContentType", "failOnUnprocessableDocument", and "indexStorageMetadataOnlyForOversizedDocuments". These properties allow you to customize the indexer's behavior when errors occur.
You can set "maxFailedItems" to the number of acceptable failures, with a value of -1 allowing processing no matter how many errors occur. For example, if you set "maxFailedItems" to 10, the indexer will continue processing even if it encounters 10 errors. This is useful if you have a large dataset and want to allow for some errors without stopping the entire indexing process.
Here are the five indexer properties that control the indexer's response to errors, along with their valid values and descriptions:
By understanding and using these properties, you can customize the indexer's behavior and ensure that it continues processing even in the presence of errors.
Limitations
ADLS Gen2 indexers have some limitations you should be aware of. ADLS Gen2 indexers cannot use container level SAS tokens for enumerating and indexing content from a storage account.
This is because they make a check to see if the storage account has hierarchical namespaces enabled by calling the Filesystem - Get properties API. If this isn't enabled, you're better off using blob indexers for performant enumeration of blobs.
You might also find that blobs aren't guaranteed to get reindexed if you rename a directory and the property metadata_storage_path is mapped to be the index key field.
To fix this, you can update the LastModified timestamps for all the affected blobs.
Gains and Losses
As we explore the limitations and considerations of this topic, it's essential to acknowledge the gains and losses involved.
One significant gain is the increased efficiency of processes, as seen in the example of streamlined workflows that can save up to 30% of time and resources.
However, this efficiency often comes at the cost of job losses, as automation replaces human labor in various industries.
A notable example is the reduction of manual labor in manufacturing, where machines can produce goods faster and more accurately.
On the other hand, the increased efficiency can also lead to cost savings, which can be reinvested in the business or passed on to customers.
For instance, companies that adopt efficient production methods can reduce their overhead costs by up to 25%.
While gains in efficiency are significant, it's crucial to consider the potential losses, including the negative impact on local communities and the environment.
The loss of jobs in one area can lead to a ripple effect, causing economic instability in surrounding communities.
In some cases, the increased efficiency can also result in a decrease in product quality, as machines may not be able to replicate the nuances of human craftsmanship.
However, it's worth noting that many companies are working to mitigate these losses by implementing measures to support workers who have lost their jobs due to automation.
Frequently Asked Questions
What is an index in data processing?
An index in data processing is a fast-access tool that helps quickly locate and retrieve specific data. It's essentially a shortcut to your data, making it easier to find what you need.
Sources
- https://learn.microsoft.com/en-us/azure/search/search-howto-index-azure-data-lake-storage
- https://delta.io/blog/delta-lake-vs-data-lake/
- https://lakefs.io/blog/data-lake-architecture-components/
- https://medium.com/event-driven-utopia/enhancing-delta-lake-performance-with-indexing-a-comprehensive-guide-099455821334
- https://www.starburst.io/blog/delivering-text-search-capabilities-directly-on-the-data-lake-with-starburst/
Featured Images: pexels.com