DocumentDB Change Streams Simplify Data Synchronization

Author

Reads 981

A laptop displaying an analytics dashboard with real-time data tracking and analysis tools.
Credit: pexels.com, A laptop displaying an analytics dashboard with real-time data tracking and analysis tools.

DocumentDB change streams are a game-changer for data synchronization, allowing you to easily replicate data across multiple systems and applications.

With change streams, you can capture every change made to your data in real-time, including inserts, updates, and deletes.

This means you can ensure data consistency across all your systems, eliminating the need for manual data synchronization.

Change streams also enable you to build real-time data pipelines, making it easier to integrate with other services and applications.

Setup and Configuration

To set up a DocumentDB cluster, you need to create it using CDK's DatabaseCluster construct. The engineVersion is set to 4.0.0, which is the only version that supports change streams.

The DatabaseCluster construct creates a master user secret for you and stores it in Secrets Manager. This secret is stored under a name defined in masterUser.secretName.

To ensure the cluster is launched in a private subnet, you need to set vpcSubnets.subnetType to SubnetType.PRIVATE_WITH_EGRESS. This allows the cluster to have only outbound internet access.

The DatabaseCluster will automatically select private subnets for you. You don't need to worry about selecting the right subnets manually.

To avoid any unexpected costs, you should set the removalPolicy to RemovalPolicy.DESTROY. This ensures the cluster is deleted when the stack is deleted.

Enable

Credit: youtube.com, Amazon DocumentDB integrations with the AWS ecosystem- AWS Virtual Workshop

To enable change streams in DocumentDB, you need to configure them for a specific collection or the entire database. In a real-world scenario, this task would typically be performed through a script during deployment or manually on the cluster, but for this demo, a Lambda function is the simplest solution.

Direct access to the DocumentDB cluster is not possible when it's deployed in a private subnet of the VPC, so a Lambda function is used to configure change streams on the demo collection. This Lambda function is deployed within the VPC and exposed through API Gateway, enabling invocation from outside the VPC.

To enable change streams, you must explicitly enable them for all collections that you want Fivetran to sync. You can do this by following Amazon DocumentDB's Enabling Change Streams instructions.

Change streams are disabled by default and can be enabled at the individual collection level, database level, or cluster level. To enable them on your collections, you can use the mongo shell to connect to Amazon DocumentDB and execute an admin command.

Credit: youtube.com, DocumentDB Insider Hour | Episode 27 | Lambda Event Source Mapping

Here are the steps to enable change streams on a collection:

  1. Connect to Amazon DocumentDB using mongo shell.
  2. Enable change streams on your collection with the following code: db.adminCommand({modifyChangeStreams: 1, database: "inventory", collection: "product", enable: true});

It's recommended to enable change streams for only the required collections to avoid unnecessary data streaming.

Event Source Mapping

Event Source Mapping is a technique that allows you to understand how data changes in DocumentDB affect your application. This is especially useful when using Change Streams, as it helps you identify the source of changes in real-time.

Change Streams are a way to get notifications when data changes in DocumentDB, but without Event Source Mapping, it can be difficult to determine the exact source of those changes.

DocumentDB's Event Source Mapping feature provides a unique identifier for each event, which can be used to track the source of changes.

This unique identifier is called the "Event ID" and it's a 24-character hexadecimal string that's generated by DocumentDB.

DocumentDB also provides a "Session Token" that can be used to resume a Change Stream from where it left off, even after a restart or failure.

With Event Source Mapping, you can use the Event ID and Session Token to identify the source of changes and take corrective action.

Deployment and Testing

Credit: youtube.com, AWS re:Invent 2019: Migrating your databases to Amazon DocumentDB (DAT372)

To deploy the documentDB change streams solution, you'll need to enable the service-linked role for the OpenSearch service. This can be done using the AWS CLI command if it's not already created.

The entire solution is organized into four stacks: change-streams-demo-vpc-stack, change-streams-demo-documentdb-stack, change-streams-demo-opensearch-stack, and change-streams-demo-lambda-stack. Each stack contains a specific set of resources, such as VPC definition and security groups, DocumentDB cluster, OpenSearch domain, and API Gateway and Lambda functions.

To deploy the solution, run the npm command, which will use your default AWS profile's account, region, and credentials. After deployment, retrieve the URL of the API Gateway and invoke the config endpoint to create the demo collection and enable change streams.

Once change streams are enabled, you can start adding data to the DocumentDB cluster by invoking the POST method of the demo-data endpoint. The data will be synchronized to the OpenSearch domain, and you can retrieve it by invoking the GET method of the demo-data endpoint.

Optional Pipeline

Credit: youtube.com, Running tests in an external CI pipeline after a successful deployment | Humanitec In-depth Demo

You can pass a pipeline to filter or modify the change stream in some way. This can be useful for monitoring specific types of changes, such as only inserted documents.

Only certain modifications are valid in a change stream pipeline, so be sure to check the server documentation for details.

To preserve the presence and value of the "resume token" in the change stream documents, a pipeline must not change the shape of the documents or their "id" field.

Synchronization Deployment and Testing

To deploy and test the synchronization, you'll need to enable the service-linked role for the OpenSearch service, which can be done using the AWS CLI command.

The entire CDK code is organized into four stacks: change-streams-demo-vpc-stack, change-streams-demo-documentdb-stack, change-streams-demo-opensearch-stack, and change-streams-demo-lambda-stack. Each stack contains a specific component of the solution.

To deploy the entire solution, run the npm command, which will use your default AWS profile's account, region, and credentials.

Credit: youtube.com, Cantata: Deploying for Testing on Target

After deployment, retrieve the URL of the API Gateway and invoke the config endpoint to create the demo collection and enable change streams.

To enable the ESM, execute the command or open the sync Lambda function in the AWS console and select ESM from the list of triggers.

Once data is added to the DocumentDB cluster, it will be synchronized to the OpenSearch domain, and you can retrieve the synchronized data by invoking the GET method of the demo-data endpoint.

Monitor the execution and logs of the Lambda function using CloudWatch.

After testing the synchronization, delete the resources by invoking the command, which will destroy stateful resources like the DocumentDB cluster and OpenSearch domain.

All created resources are tagged with the Application tag, which has the value change-streams-demo, and you can double-check if all resources have been deleted using the Tag Editor of the AWS Resource Groups service.

Logging and Retention

Logging and retention are crucial aspects of documentdb change streams. You can set the change stream log retention duration to retain at least 48 hours' worth of changes.

Credit: youtube.com, Data management best practices for Amazon DocumentDB - AWS Online Tech Talks

Increasing the retention duration is recommended to retain seven days' worth of data. To adjust your change stream log retention duration, follow Amazon DocumentDB's Modifying the Change Stream Log Retention Duration instructions.

The minimum retention duration is 48 hours, but you can increase it to seven days or more. This will give you a longer history of changes to work with.

Solution Overview

To implement this solution, you need to follow a series of steps that involve enabling change streams on Amazon DocumentDB collections.

The first step is to enable change streams on the Amazon DocumentDB collections, which allows you to capture real-time data changes.

To do this, you'll need to create an OpenSearch Ingestion pipeline, which will help you process and load the data.

This pipeline is a crucial component of the solution, and it's created separately from the Amazon DocumentDB collections.

Once the pipeline is set up, you can load sample data onto the Amazon DocumentDB cluster.

Credit: youtube.com, AWS re:Invent 2020: What’s new in Amazon DocumentDB (with MongoDB compatibility)

Loading sample data is an important step, as it allows you to test and verify the solution before moving forward.

After loading the sample data, you can verify the data in OpenSearch Service to ensure that everything is working as expected.

This involves checking the data for accuracy and completeness, and making any necessary adjustments to the solution.

Here are the steps outlined in a concise list:

  1. Enable change streams on the Amazon DocumentDB collections.
  2. Create the OpenSearch Ingestion pipeline.
  3. Load sample data on the Amazon DocumentDB cluster.
  4. Verify the data in OpenSearch Service.

Frequently Asked Questions

What is a change stream?

A change stream is a feature that enables real-time access to data changes, simplifying the process of tracking updates. It eliminates the need for manual oplog tailing, reducing complexity and risk.

Calvin Connelly

Senior Writer

Calvin Connelly is a seasoned writer with a passion for crafting engaging content on a wide range of topics. With a keen eye for detail and a knack for storytelling, Calvin has established himself as a versatile and reliable voice in the world of writing. In addition to his general writing expertise, Calvin has developed a particular interest in covering important and timely subjects that impact society.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.