At mx51, we have a team of engineers dedicated to building robust bank-grade platforms that process millions of events and payment transactions on a daily basis. Maintaining platform and data integrity is critical given our platform is needed to support the day-to-day operations of the merchants that use our payment solutions.
Platform migration may seem like a simple task but it is not without risks and significant challenges to ensure the security and integrity of platform and data is maintained. Moreover, timing is everything when staging steps to extract, transform and load data on to the desired destination to ensure minimal disruption to the business.
Following our recent experience with migrating our platform from ECS to Kubernetes, Nikola Sepentulevski, Senior Engineer at mx51 shares our platform migration journey.
Before we embark on the migration journey, I would like to highlight the following points that helped build the new platform. To give some context with mx51 moving to Kubernetes, we had prepared the following:
A decision register - This is a spreadsheet with a list of decisions and why we made those decisions. This made tracking decisions much easier for any future references.
New GitOps tools - We were already using the gitops approach on ECS, however, some of the tooling used was limited. We decided to freshen up and use:
New documentation - As we were building the platform we would store our documentation in Git. Even though the source of truth can be found in our IaC (Infrastructure as Code), we wanted to provide a greater level of detail especially for those who are non-technical.
Developer Resource - One of the main stakeholders of kubernetes are developers. Since a lot of the tooling around our ecosystem was going to change we had a developer resource dedicated to helping build the developer experience.
Security Resource- Security comes first! Our security architect was heavily involved in our migration to ensure we were building a secure platform. As a result, a new kubernetes security tool Kubestrike was born.
Performance Testing - To ensure kubernetes will meet our performance requirements, we provisioned a dedicated account just for performance testing. This pushed kubernetes to our limits to ensure that what we were building was robust.
New DNS - To test any public endpoints, a new DNS zone was provisioned. This enabled our engineers to test the new platform in parallel.
Compute Migration Strategy selection
Our migration from ECS to Kubernetes was split into 2 stages. First stage being the migration of compute, and the second stage being the migration of data.
When referring to the migration of compute, this refers to the orchestration platform running our microservices. Staging the transfer of your compute data and ensuring contingency plans are put in place is critical to ensuring any abnormalities can be identified without affecting platform and data integrity.
In order to start sending traffic to Kubernetes, the first phase of the migration was to distribute traffic between the 2 platforms. This is referred to as the “Strangler Pattern”. Strangler Pattern is used when testing new components or when moving away from a legacy platform. This provides the ability to start testing with a small amount of traffic. We used route53 weighted records with a low TTL. Setting a low TTL on the DNS records can help if required to roll back.
As part of the preparation for the Strangler Pattern, Kubernetes had to access the legacy platform (ECS) data source, which was RDS and ElastiCache. For the applications in Kubernetes to access the RDS and Cache instances, we set up a VPC peer between the two networks, only allowing kubernetes to access the ECS private network.
We divided the “Strangler Pattern” into 4 stages:
Stage 1 - Send 25% of traffic to Kubernetes and 75% to ECS
Stage 2 - Send 50% of traffic to Kubernetes and 50% to ECS
Stage 3 - Send 75% of traffic to Kubernetes and 25% to ECS
Stage 4 - Send 100% of traffic to Kubernetes and 0% to ECS
Fig 1. Stage 1 of Strangler Pattern
We phased the traffic over to kubernetes in a couple of days, to give us enough time to monitor and review logs. This was achieved by a handful of engineers being on standby for those couple of days.
In total, the process of transferring millions of data points including terminal, merchant and user entry data took three hours between two resources and downtime was compressed to 40 minutes.
Database Migration Strategy selection
Selecting the right database migration strategy is a critical first step when embarking on your migration journey. We considered numerous factors when choosing our strategy including data integrity, expected downtime and ease of transfer. We had 16 databases of varying size to transfer and in order to ensure the migration time was as short as possible, we assessed a number of transfer options.
To ensure a homogeneous database migration, we considered the following options:
Option 1: Full data load and CDC (Change Data Capture) using DMS
This is a bleeding edge solution with no downtime, which allowed replication between 2 RDS instances that could run over a period of a week, allowing us to test the database in order to build our confidence for the final switch. However, we did not pursue this option due to the following limitations:
Our databases had known data types that were a limitation with CDC using DMS
Scenarios which could compromise data integrity
DMS was not recommended for homogeneous migration such as ours
Fig 2. Option 1 of database migration
Option 2: Data dump and restore without any downtime
We assessed this option using native database tools without any downtime. The benefits of this option was the ability to switch traffic to the new RDS instance while we backfill data. However, we did not go with this option due to the following reasons:
Backfilling of data was taking too long
There will be a period of missing data to our end users
Fig 3. Option 2 of database migration
Option 3: Data dump and data restore with scheduled downtime
This option uses native tools to migrate data from one RDS to another. Even though we had to schedule downtime, we selected this option as it ensured our data integrity confidence remained high. After running some tests to estimate the downtime required to be scheduled, we implemented a robust process for the day we kicked off the migration.
Fig 4. Option 3 of database migration
We settled on Kubernetes, with everything running remotely through a continuous pipeline. Like a house on wheels moving to another street without anything changing, we were able to switch platforms from our legacy container service to Kubernetes with platform data integrity maintained. We were able to complete the migration to another AWS account, maintaining integrity with less than 40 minutes of scheduled downtime.
In total, the process of transferring millions of records including terminal, merchant and user entry data took three hours between two resources and downtime was compressed to 40 minutes.
By following this process, we were able to maintain platform and data integrity, minimise downtime and achieve consistency in the pipeline. This was achieved by being given the freedom to test and prototype different options and by having the ability to automate steps that could be scheduled in real-time.
As a result of this migration, we’re now able to reduce the platform turnaround time to onboard new tenants, delivering a better experience for our customers with faster onboarding.
If you’re interested in learning more about our white-label platform solution, get in touch.