Lessons learned: Performing load tests on the spike-oriented, CPU-intensive JVM application

8 min readAug 13, 2024

Some time ago, I was working on designing and performing load tests of a backend application providing a completely new service for our clients.

The application had to perform perfectly under heavy traffic, which was expected to arrive mostly in the form of sudden and heavy spikes. Load testing was supposed to uncover potential problems in code, help with the optimization process, and most importantly prove that the service could meet all of the performance goals.

It was a very successful endeavor. It helped us discover performance bottlenecks, led to major cost cuts on Kubernetes, and most importantly of all — it played a major role in building trust in our code.

Thorough, repetitive testing lets you embrace change easily without any fear, providing a fast feedback loop anytime you need it.

Load tests are an important step in finding flaws in the code, as well as in efficiently balancing resource assignments (memory and CPU), finding a proper JVM configuration (Garbage Collector, Xms/Xmx), and replica count (with and without HPA) to achieve the maximum performance needed with as few resources possible.

Here are some thoughts, observations, and lessons learned that may help you find your path to proper load testing.

Why bother load testing? Pick your reasons, there are many!

Our application in the test was an orchestrator of many integration calls. The majority of the calls were synchronous, which effectively meant that each call was being fed with data from the previous calls and it had major consequences:

the application response time was a sum of all intermediate roundtrips time plus a latency added by an orchestrator itself,
every call to the external services took some time, which could not be spent waiting — CPU had to switch their focus toward other threads that are ready for action,
every operation that blocked the CPU from switching to other tasks, like synchronization blocks, and thread pinning (Virtual Threads) would affect the whole system,
tasks that would not end quickly would block intermediate objects from being removed from memory by the GC efficiently, leaving a considerable memory footprint and possibly forcing GC to stop the world for a major cleanup or at least, use some more of the CPU time to run cleanups more frequently,
thread blocking, pinning, and the tasks that take too long, accompanied by a steady flow of new requests will result in CPU throttling,

Load testing will not only ensure you, that your application can handle the required load, but it can also help you with finding bottlenecks in both code and configuration. It can assist you with optimizing JVM, finding proper resource assignments, and replica count for your Kubernetes deployment. It may also test the performance of your downstream and upstream services.

Eliminate external services in the initial phase of testing

In the initial load-testing phase, you should focus only on testing the performance of your code. Introducing external systems’ latencies, deployment quirks, performance issues, and possibly optimization problems will cloud your view and distract you from spotting issues within your codebase.

Introduce mocks for external services and configure your application to use them instead of the real services. Remember to introduce **artificial latencies** for the responses. Set their values to be similar to the average response time of the mocked service.

To make the mocks behave even more realistically, make the latencies vary a little for each call — they could take a value from a range of:

<avg-randomX, avg+randomX>

milliseconds.

Find the initial, reasonable static configuration for your deployment. Favor bigger JVMs with fewer replicas at the beginning — more requests flowing through a single deployment makes it easier to spot potential problems with your codebase.

Introduce integrations with external services one by one

After your application has passed a previous phase with flying colors and you have found your initial deployment configuration that handles the load generated by the test scenario, it is time to replace mocks with real services.

introduce external integrations one by one,
rerun load tests each time, making sure the test is still passing,
in case of any problems, resolve them immediately in cooperation with the owners of the service in question,
rerun load tests after each change.

At this point, you may feel like finding bottlenecks in someone’s services and forcing them to work on providing a robust solution is a rude thing to do, but you are mistaken!

Finding such problems at the early stage, especially before going live with an important service, gives the team time to prepare a fix without haste, helps them build up knowledge and trust in their service, and saves all parties from busy on-calls in the future.

Optimizing resource assignments and replicas count for K8s deployment

This is an important step that is often neglected or even skipped in many projects. Organizations of all shapes and sizes tend to over-provision their deployments with values that have nothing to do with their actual needs.

Over-provisioning might not always be a bad thing to do, but only if you know why you are doing it. In most cases, it is just a way of covering performance problems or design flaws with money.

Putting a strain on your code, observing how it behaves, and understanding what you see is crucial for proper optimization and introducing further improvements:

perform load testing on various deployment configurations, observe how the application behaves after the changes,
you can begin this phase using mocks of the external services from the previous section, but you don’t necessarily have to,
start with a reasonable setup you discovered in previous sections and use it as a base for further experimentation,
organize your testing and do not change many things at once. Focus on a single configuration value, rerun the load tests every time, observe the monitoring, and take notes.

It might take a long time to find an optimal setup, as it is not an easy thing to do. When you feel that you have already optimized a lot and the test is still passing, deploy it and observe how it behaves in production.

Further optimizations — configuring Horizontal Pod Autoscaler (HPA) for the deployment

Our test case was special because the traffic we had to deal with was expected to appear in spikes. Spikes that we have prepared for were rare, but demanding. We wanted to prepare our service for spikes rising to hundreds of requests per second.

We already had a well-balanced, working, static deployment with optimized resources and replicas count. Nevertheless, most of the time the applications were running idle, so there was still room for further optimizations.

Introducing Horizontal Pod Autoscaler (HPA) can go wrong and here’s why

As simple as it may look, reducing the replica count by some value and configuring HPA to bring that number back will not always work as expected.

I am not a Kubernetes expert and most of the following points are based on my observations and assumptions, but some of them have proven to have merit, so I would like to share them with you:

using a static deployment for your application makes your cluster reserve resources on available nodes upfront.

When HPA starts to scale out pods, the cluster allocates a significant amount of resources on available nodes. It may happen that existing nodes are not enough and the whole cluster will have to scale out too.

As a result, the load test will not pass as it usually has a huge negative effect on the performance of the service,

fresh JVMs (in most versions) will use much more CPU cycles until they warm up, so if you have optimized your CPU assignments very well, you might experience CPU throttling given that additional, temporary usage.

Combining this unaccounted-for CPU usage with a constant flow of incoming requests, the performance will decrease with time, until your app will finally start to time out on healthiness/readiness probes and the pod will be killed. There will be, of course, a new pod spawned in its place, but it will quickly share the same fate.

Increasing CPU assignment in your deployment configuration may help you mitigate this issue. It will also leave you with an over-provisioned CPU just after JVM warms up, as additional cycles will not be used most of the time.

Do not limit CPU assignments!

Configuring CPU assignments for your deployment should not use limit, as the requests/limit concept works differently than it does for memory assignments.

What is not a piece of common knowledge, CPU request configuration guarantees that your pod will be assigned the requested CPU cycles (as long as there are physical CPU resources available on the cluster). Therefore not using a limit configuration does not save the others from being robbed from their CPU cycles by your greedy deployment.

When you do not limit CPU assignment for your pods, they will be allowed to use more cycles than they have requested, but only if there are some idle CPU cycles available on the node.

It makes a huge difference because it not only solves the JVM warmup problem described in the previous section but also lets the initial replicas keep up with the incoming requests until HPA spawns more instances.

Here is a link to the article that I found very informative: https://home.robusta.dev/blog/stop-using-cpu-limits

Here is a link to the K8s design proposal that has described resource guarantees: https://github.com/kubernetes/design-proposals-archive/blob/8da1442ea29adccea40693357d04727127e045ed/node/resource-qos.md#compressible-resource-guarantees

Scale out quickly, but scale down slowly

When your application works with an intensive incoming traffic, you should be prepared, as soon as you notice the first signs of it. In the worst-case scenario, you will have too many replicas for a few minutes until the cluster cleans them up, but in most cases, you will be ready to handle the traffic before it gets heavy.

Find the right Garbage Collector for your use case

There are several GCs available that you may use with your JVM. Each one of them has its own requirements and work characteristics.

Choosing the right GC may have a significant impact on your CPU and memory usage. It may also affect the performance of your service depending on its memory cleanup strategies.

Many sources on the internet will help you choose GC candidates that may work for you, so I will skip describing them here. Your options vary depending on GC requirements for available CPUs and heap size (ex. parallel GCs require 2+ CPU, some work best with small heaps, some are suited for bigger heap sizes, etc.).

Test them using your load test scenarios to see which one will best fit for achieving the best possible performance, using as few resources as possible.

Final notes

special tasks require special tools. Design a solid test scenario. Aim for simplicity in code and in a way of running the test. You should be able to run the test anytime in just a few seconds!

Remember to run the test within the cluster! Especially when hosting your Kubernetes cluster on the cloud. Cloud providers like AWS can charge additionally for the incoming traffic from external sources.

do not neglect the value of performance testing. It can be an invaluable tool for finding performance bottlenecks, deployment misconfigurations, and building trust in the developed solution,
do not over-provision resources and replicas to cover your problems with money. When you suspect that your application is leaking memory or uses CPUs inefficiently, use your load tests, try to understand why it happens, and seek a solution.