Migrating to Azure Cassandra for Scalable Storage

Author

Reads 1.1K

Experience a serene ocean view with an expansive blue sky and distant islands on the horizon.
Credit: pexels.com, Experience a serene ocean view with an expansive blue sky and distant islands on the horizon.

Azure Cassandra is designed to handle large amounts of data and scale horizontally, making it a great choice for businesses with growing storage needs.

It offers high availability and fault tolerance, ensuring your data is always accessible and safe.

With Azure Cassandra, you can easily scale up or down as needed, without having to worry about downtime or data loss.

This flexibility makes it an ideal solution for businesses that experience sudden spikes in data usage.

Migrating to Azure Cassandra

There are two main methods for migrating data to Azure Cassandra: Azure Managed Instance hybrid cluster and the Cassandra Migrator.

The Azure Managed Instance hybrid cluster is a Cassandra cluster with specific default configurations, automations for monitoring, backup, and repair.

Migrating data to Azure Managed Instance can be done in a variety of ways, but the two main methods we'll focus on are the hybrid cluster and the Cassandra Migrator.

The Cassandra Migrator method involves running a Spark job that copies data from an existing Cassandra instance to an Azure Cassandra instance.

This method requires a Spark cluster in addition to two Cassandra clusters, one pre-existing on-premise or cloud cluster and the other in Azure.

Migration Overview

Credit: youtube.com, Symantec Live Data Migration from Cassandra to Azure Cosmos DB

Migrating to Azure Cassandra can be a complex process, but understanding the basics can make it more manageable.

Azure Managed Instance for Apache Cassandra is a Cassandra cluster with specific default configurations, automations for monitoring, backup, and repair.

Migrating data to Azure Managed Instance from an existing Cassandra database can be done in a variety of ways, each with its own advantages and disadvantages.

Azure Managed Instance and Cosmos DB are the two main options for migrating data to Azure, with the latter being a cloud-native NoSQL database that uses its default data model to mimic the way that data is managed in other databases.

To migrate data to Azure Managed Instance, you can use the Azure Managed Instance hybrid cluster or the Cassandra Migrator, each interacting with the system differently and requiring different external systems to work.

Cosmos DB's Cassandra API can interact with Cassandra drivers and CQLSH, making it a viable option for migrating data to Azure.

Migrator

Credit: youtube.com, Apache Cassandra Lunch #121: Migrating to Azure Managed Instance for Apache Cassandra

The Azure Cassandra Migrator is a powerful tool for migrating data from an existing Cassandra instance to an Azure Cassandra instance.

It involves running a Spark job that copies data from the existing Cassandra instance to the Azure Cassandra instance, requiring a Spark cluster in addition to two Cassandra clusters.

This method can be run using Azure Databricks for Spark and a Scala notebook to manage the process, or with any Spark using a JAR file and a config file.

The process starts with creating both Cassandra clusters and a Spark cluster, then setting the configs, and finally starting the process by running the notebook or using spark-submit.

One major drawback of this method is that it only migrates data that exists at the time of the migration, and does not continue to mirror changes over time.

In comparison, the Azure Managed Instance hybrid cluster method continues to mirror changes over time, making it a more suitable option for ongoing data synchronization.

Azure Cassandra Setup

Credit: youtube.com, Introduction to Cassandra API in Azure Cosmos DB

To set up VNet Peering, which allows you to route traffic between two virtual networks privately, navigate to the Azure VNet Peering tab of your cluster and click on Add New VNet Connection.

For VNet Peering to work, your cluster and the VNet must be on the same subscription, and the peering request will automatically be accepted in this case.

To test the peering, you can try netcat or telnet, and a result of 0 on port 9042, which is the exposed port for CQL, indicates success.

Here are the steps to create a VNet Peering request:

  1. Once your cluster has been provisioned, you can create a VNet Peering request through the Instaclustr console.
  2. Fill in the required information on the VNet Peering Connections and click the Submit Virtual Network Peering Request button.
  3. The peering request is then submitted.

If you intend to connect from both the peered VNet and other sources, you can refer to the Instaclustr support article for your options.

Set Up Hosted Cqlsh

To set up hosted Cqlsh, click on Open Cassandra Shell again to get the hosted shell. This will allow you to access the Cassandra database.

You can then explore data modeling basics, partitioning, and secondary indexes. These are essential concepts to understand when working with Cassandra.

To manage throughput, you'll want to consider the following topics:

  • Data modelling (basics)
  • Partitioning
  • Secondary Indexes
  • Throughput management

Create Keyspace, Table and Seed Data

Credit: youtube.com, Astra - Create Keyspace

To create a Keyspace, Table, and seed some data in Azure Cassandra, you'll need to define the compound primary key. The combination of station_id and ts forms the compound primary key, where station_id is the partition key and ts is the clustering column.

Station_id is the partition key, which means all data for a specific station will reside within a single partition. This is useful for storing large amounts of data.

The clustering column is ts, which is a timestamp type. This column sorts the rows within each partition in descending order. This ensures that the most recent data is always at the top.

Here's a breakdown of the compound primary key:

By defining the compound primary key, you'll be able to efficiently store and retrieve data in your Azure Cassandra database.

VNet Peering Setup

VNet Peering Setup is a method for routing traffic between two virtual networks privately. This allows you to access your cluster via private IP and makes for a much more secure network setup.

Credit: youtube.com, An Introduction to Virtual Network (VNet) Peering in Azure

Note that VNet Peering is only supported when running in your own Azure account, and you should contact Instaclustr Support if you're interested in setting this up.

To set up VNet Peering, you'll need to create a VNet Peering request through the Instaclustr console. Here's a step-by-step guide:

  1. Create a VNet Peering request by navigating to the Azure VNet Peering tab of your cluster and clicking on Add New VNet Connection.
  2. Fill in the required information on the VNet Peering Connections and click the Submit Virtual Network Peering Request button.
  3. The peering request is then submitted. If the cluster and the VNet are both on the same subscription, then the request will automatically be accepted.

Note that if you only intend to connect to your cluster from a peered VNet, you should enable Use private IPs to broadcast for auto-discovery under Cassandra Setup when you create a new cluster, or create a private network cluster. If you're peering into an existing public ip cluster, contact support to change the nodes' broadcast address over to private ip.

Azure Cassandra Configuration

You can manage throughput in Azure Cassandra either manually or using Autoscale, giving you flexibility in how you configure your database.

Throughput can be configured at the keyspace or table level using CQL ALTER commands, allowing for fine-grained control over resource usage.

Request Units (RU/s) can be set to optimize performance and prevent overloading, making it easier to scale your database as needed.

Filter by Non-PK

Credit: youtube.com, How to Setup Apache Cassandra Cluster on Ubuntu in Azure (2 Node Cluster)

Filtering data in Azure Cassandra can be a bit tricky, especially when it comes to non-primary key columns.

You should avoid executing filter queries on columns that aren't partitioned, as it can result in poor performance.

Explicitly allowing filtering can be done with the ALLOW FILTERING keyword, but it's not recommended due to potential performance issues.

Inserting data is a good time to think about partitioning, as it can help with query performance later on.

It’s not advisable to execute filter queries on the columns that aren’t partitioned.

We can use ALLOW FILTERING explicitly, but that results in an operation that may not perform well.

Secondary Index

In Cassandra API, Azure Cosmos DB doesn't index all attributes by default, unlike the core SQL API. It supports Secondary Indexing to create an index on certain attributes.

A default index with a specific format is used, which you can check by describing your table. The last line of the output will show the index, like data_state_idx in our case.

Credit: youtube.com, Apache Cassandra Lunch #57: Using Secondary Indexes in Cassandra

You can query by state using select * from weather.data where state = 'state1' without adding ALLOW FILTERING.

It's not a good idea to create an index on a frequently updated column, as it can cause a performance penalty for constantly updating the index.

Creating an index when you define the table ensures that data and indexes are in a consistent state.

To remove an index, you simply use the remove index command.

Here are some key considerations for secondary indexing:

  • It’s not advisable to create an index on a frequently updated column since there is a performance penalty for constantly updating the index
  • It is prudent to create an index when you define the table. This ensures that data and indexes are in a consistent state

Partitioning

Partitioning is a key concept in Azure Cassandra configuration, and it's essential to understand how to optimize your database for querying.

You can create a separate table for queries by state, like data_by_state, which has a compound primary key with state as the partition key and ts as the clustering column.

Inserting data into this table is straightforward, and you can execute queries that filter by state and date/time.

To further optimize queries, you can create a composite partition key, which limits the partition size to a day's worth of data for a particular state.

A composite partition key consists of state, day, and ts, where ts also contributes to the uniqueness of each record.

You'll need to specify both partition keys, state and day, to query records in the new table, data_state_per_day.

Throughput Management

Credit: youtube.com, Understanding Apache Cassandra™ Performance Through Read/Write Metrics | DataStax Accelerate 2019

You can manage throughput in Azure Cosmos DB either manually or using Autoscale. In both cases, you can use CQL ALTER commands to configure Request Units (RU/s) at the keyspace or table level.

There are two ways to manage throughput: manual mode and Autoscale mode. Manual mode is where you specify the throughput for a keyspace or table.

In manual mode, you can create a keyspace with a specific value for cosmosdb_provisioned_throughput, which will be shared with all tables in that keyspace. This value defaults to 400 RU/s if not specified.

You can update the throughput of an existing keyspace, but you'll get an error if the keyspace doesn't have a shared throughput configuration. This is what happened when we tried to update the weather keyspace earlier.

To update the throughput of an existing table, you can use the CQL ALTER command, but you'll still get an error if the keyspace doesn't have a shared throughput configuration.

Frequently Asked Questions

What is the difference between Cassandra database and Azure Cosmos DB?

Apache Cassandra and Azure Cosmos DB differ in their default write configurations, with Cassandra being a multi-master system and Cosmos DB offering flexible write options, including single-region writes. This difference impacts how data is written and replicated across regions in each database.

Glen Hackett

Writer

Glen Hackett is a skilled writer with a passion for crafting informative and engaging content. With a keen eye for detail and a knack for breaking down complex topics, Glen has established himself as a trusted voice in the tech industry. His writing expertise spans a range of subjects, including Azure Certifications, where he has developed a comprehensive understanding of the platform and its various applications.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.