Migrating standalone Mongo to Atlas

Some time ago I worked on a very interesting project, which was a bit different from all the other things a was used to do. It was about migrating our production database from a standalone self-hosted MongoDB instance to one of the cloud providers.

Evaluating our options

At first, I summarized different options that we could go with. They included continuing using a self-hosted cluster as well as moving to any of the well known cloud DBaaS providers. We chose to go with Mongo Atlas as it suited our needs and the price was acceptable.

One of the most important things I learned at this stage was that it is a real pain to run your own MongoDB cluster, especially without having a dedicated DevOps engineer.

Preparing the application

Long before the actual migration began, I started figuring out how the process would go to have close to zero down time, and as little restarted jobs as possible.

The application was deployed in Docker containers on AWS ECS. Most of its components were very easy to restart without losing anything, but there also were some long running tasks (jobs) that we would like to keep running no matter what. A sudden disconnect from our database could mean failing some of them, if the new replica set would not become available in a very short time. To ensure that the connections will have enough time to restore, our system was adjusted to retry database connections for a much longer time before failing, than before. This should not be the case in a stable production environment since problems should become clear as fast as possible.

Testing in staging environment

It was very clear that such a migration must be rehearsed. I learned a lot from migrating a standalone MongoDB instance in staging environment to Atlas, and surely felt more confident when doing the same things again in production. You should become very familiar with the process before you touch your production database.

Cluster preparation

Migration process could be time consuming, especially if something went wrong. That’s why any task that could be completed separately in advance, should have been done as early as possible in the process. Setting up a cluster in Atlas is a pretty easy task, but there are a lot of things to test. Did we add all the addresses to IP white list in AWS? Did we create a user with enough permissions for Atlas Migration Tool? Are the new instances accessible from our app? Is VPC peering working properly?

Once the answers to all these, and some other questions, were positive, the process could be started.

Migration from a standalone instance to Mongo Atlas replica set

At first, I needed to convert our standalone MongoDB deployment to a single node replica set, since it was impossible to migrate a standalone instance to Atlas directly. This step is covered in most of the guides:

# /etc/mongod.conf
replication:
  replSetName: rs0

followed by service restart. This part is quick and causes very little down time. Our app knew how to handle short database hiccups, so that was fine. Since it was a new replica set, another command was required: rs.initiate(), when running MongoDB shell as admin (mongo -u admin --authenticationDatabase admin admin --password). It takes very little time as well, but you need to initiate a new replica set as close to service restart as possible.

At this point, everything was ready for Live Import. Our app had a fairly small database, so this step did not take too much time. While it was happening, we constantly monitored our application and made sure that it works properly.

Another test we decided to run while having some time during data import was to create a very small application using the same stack as our main app, and make sure the exact code we use in production was able to communicate with a new server. That’s how we ensured that drivers we used were compatible with the MongoDB version that Atlas had.

Cutoff

When cutoff option became available in Atlas, we updated the application to use a new connection path. All the tasks that started after this point were using Atlas. All the other tasks were still using our old database. It was OK, since live import was still running, and it was able to catch up with all the changes on the legacy instance almost instantly.

We restarted almost all containers with the new code. ECS made this part easy. There were still some jobs using the old database though. We could tell about some of them that they were supposed to take a long time to complete; these jobs were killed and restarted. Other jobs were given some time to complete.

To tell how many connections were left to the legacy instance, we used the following command:

netstat -tn 2>/dev/null | grep :27017 | awk '{print $5}' | \
cut -d: -f1 | sort | uniq -c | sort -nr | head

We waited for some time until we saw only Atlas IP addresses in the output. It was the point where everything was connected to Atlas. We shut the old node down.

The final steps were to setup alarms in Atlas and create user credentials for whoever was granted access to the database. In the following days we rolled back the retry intervals that were added to deal with short database hiccups during the process.

This operation was successful mostly because of the extensive testing in staging and very methodical and cautious process in production.