How we’ve been learning to live in clouds

Our client is an integrated system to manage non-profit organizations. The company develops a web system (SaaS) that helps to automate the daily workflow for various organizations around the world. Currently more than 15,000 companies and 6,000,000 professionals are using the product. Moreover, the numbers are constantly growing.

In other words, our customer is an American membership management system that allows to create communities/clubs. Each club has got a separate site with its own template.

Got a project in mind?

There is no better place for a QA solution than Performance Lab.
Drop us a line to find out what our team can do for you.

Task

The company has decided to migrate to the cloud platform Amazon Web Services (AWS). The platform change was to reduce infrastructure costs, ensure 100% availability and ease the
system scaling.

After the migration they needed to ascertain that the system has not worsened: determine its performance, execute synthetic tests, configure the online load monitoring system that allows comparing test results.

Here are the problems that needed a solution:

At the project start the customer sent us logs that amounted to 50 GB, despite covering only one month. We had to somehow process them, in order to create the load profile. Our servers didn’t have enough resources. So we had to ask ourselves: what shall we do? The answer came right away – we had to develop our own way to process the logs.

We’ve decided to separate the log processing to 2 steps.

01

In the first step we removed the unnecessary information using PowerShell.

02

In the second step we combined the data, calculated parameters and built diagrams employing Python and its Pandas library

In short, analyzing the first log batch took us about 2 days.

As the system lives in the clouds, not only do servers often get changed, but also new servers are used when more resources are needed. That’s why everything should work out-of-box. It follows that we’ve added Telegraph in all virtual machine images. We’ve distributed the monitoring on 29 continuously used servers + 3 performance testing servers + 1 server to collect the monitoring data. We’ve created 7 Telegraph configuration files that collected the hardware usage information, as well as monitored the servers.

Learning-to-live-in-clouds-steps

In order to retrieve more precise data, we’ve optimized the database queries (to accumulate more information in one query), hidden seldom used diagrams, configured notifications and displayed them on a separate dashboard. We’ve monitored the alert list, but only partly, so as not to cause an overload.

As a monitoring result we’ve acquired 11 dashboards in Grafana, each illustrated by diagrams. We’ve also monitored JMeter itself: error percentage, TPS, response duration.

As we’ve approached synthetic tests, the customer asked us to measure practically all parameters. Again there was a question: how? Then we’ve analyzed relevant solutions that allow to test fast and to the highest possible standard.

The key choice factor was that the solution is cross-platform.

We’ve chosen the following instruments for synthetic tests:

  • CPU – LINPACK;
  • Memory – pmbw;
  • Network – Iperf3;
  • Disk – known by all Fio.

After we came to a decision about the software list, we started determining metrics.

We’ve chosen the following metrics: for disk the block size, for memory – blocks again, as well as interaction types, such as reading/writing and randomized access. The customer wanted to run synthetic tests before starting each machine to ensure that is as functional as others. The testing lasted 8 hours for a server when the metrics were not chosen! With them it only took
half an hour.

It should also be taken into account that all testing data may be incorrect, because AWS hard disks experience BURST. Each hard disk has got a credit of 5 million input-output operations for ca. 30 minutes. This credit allows to use more input-output operations than purchased, in order to quickly and simultaneausly start all applications. After the credit has been used, the
input-output operations are set back to the previous level. If we go lower than the baseline (about 3000), the credit will start growing again. It all depends on the hard disk type, its volume, etc.

Additionally many disks are optimized to a certain block size, which should also be taken into account. The formulas to calculate the BURST also depend on the hard disk type. So we had to prepare the system a lot to exhaust the BURST, before we could start with the synthetic monitoring and load tests. Otherwise the performance testing results were
absolutely unrealistic.

Eventually we’ve started with the load testing. After the first system runs we’ve discovered the following problems:

01

In the first step we removed the unnecessary information using PowerShell.

02

There were no available connections, as we had a content management system with 5800 separate sites. It was not highly probable that we would come back to the site again, and the system will still have the same connection. As a result there are no more established connections on a server and the load keeps falling.

03

Memory leaks – JMeter leaks around 6 GB per hour, so with a big load a test takes 6 hours instead of 12 hours. Don’t forget that we work in AWS, where we fight for every GB.

We’ve changed the following parameters on one machine in order to increase the connection amount:

  • MaxUserPort – free ports amount;
  • TcpTimedWaitDelay – socket waiting duration;
  • TcpFinWait2Delay – the amount of time needed to keep a half-closed connection;

It has helped, but after some time the test has crashed again. Than we’ve decided to switch off keep-alive in scripts, because we the tests don’t return to the same site most of the time. The problem was solved.

To find out the memory leak cause, we’ve configured a monitoring for the Java Virtual Machine. For that we’ve connected the Java agent to JMeter, retrieved the memory dump and analyzed it. The Java agent data were sent directly to the database and monitored on a realtime basin in Grafana. It allowed to immediately evaluate the changes that we’ve made in the Java Virtual Machine and in the script.

Analyzing the memory dumps, we’ve found out that the HTTP request classes are taking most memory space, as they collect all the request information, but are not discarded by the garbage collector. That’s why there were memory leaks.


As we had a head of 7 GB, we’ve solved the problem by migrating to the BlazeMeter’s plugins for JMeter and Free-Form Arrivals Thread Group; it was a process of trial and error that has taken much time.

Results

All parameters and configurations in clouds are much different to the same hardware that can be accessed outside clouds. There are many details to take care of. The system didn’t work neither first nor second time we’ve started it, but with the 8th iteration we’ve succeeded. 

We’ve applied new approaches and used new JMeter functionalities that eased our work a lot. During testing we’ve chosen the best AWS hardware configuration for the customer system. The AWS migration was a success!

Download
brochures

More information about QA solutions we provide is available in our brochures

PL CSV SOLUTION

TEST AUTOMATION SERVICES

CASE STUDY: TOP 10 BANK AT

PERFORMANCE TESTING BROCHURE

AGILE PERFORMANCE TESTING

IVR LOAD TESTING SERVICES

CASE STUDY TOP 10 BANK LT

CASE STUDY RETAIL

CASE STUDY GOVERNMENT

QA OUTSOURCING WHITEPAPER

Latest posts from us

How to Build the Best Software Testing Team: A QA Project Manager’s Guide
6 Best Mobile App Testing Tools for Android & iOS
Everything You Need to Know Before Starting API Testing
Creation of the Load Testing Profile
Why Remote Testing Services are so Important during and after Pandemic