Home > GLASSFISH / J2EE / JAVA PERFORMANCE, JMETER, network monitoring, Profiling, visualvm, Your Kit Profiler > No such thing as Nirvana – A classic – Reaching the maximum http alive connections

No such thing as Nirvana – A classic – Reaching the maximum http alive connections


Case behind the story

Dirk, one of the guys in our team, comes to me on a thursday afternoon, asking me the reason why our system cannot sustain one million transactions over a specified unit of time.

Hmmm, actually this was only his second question if i think about it. The more important question was:

“Alex, regardless what i do, and how many virtual users i use, i cannot overcome a limit of 15 transactions / second. Any idea why?”

First thing that pops in my mind is looking back on those performance results that i delivered a couple of month ago. I could clearly remember that i marked a threshold of 170 transactions per second, having somewhere like 500 virtual users, at once, in the system.

So for a second, i think that something went wrong with our application, that maybe something that simple has just skipped my test scenarios. So i turn to my JMeter tests, doing a load test with a small user load configuration, checking the output on the application server, and delightfully watching a throughput of more than 50 transactions per second, with a CPU usage of less than 30 % on the application server.

So where is the problem?

Well, basically we were talking about two different test harnesses. One, the one i was using, JMeter, was clearly able to achieve the desired throughput, with acceptable response times.

The other one, using the same approach (same application workflow), being an in-house developed test tool

Lesson one

Your role as a performance tester is to make sure  a system “delivers”. “Errare humanum est”, goes a latin old saying. And that applies to you, and to the developers programming the software.

As big as the trust has grown between you and the developers, try keeping in mind they make mistakes too. And it is your role to find them, it is your role to be the pessimist among the team.

Trust one thing only: that what you can see: response times & throughput. They are the two most useful indicators before digging in on any performance problem.

Starting point

So there i was, running my JMeter test, managing to achieve 50 transactions / second, with about 12 virtual users. So first of all, i ensure our team that the application itself, never mind the application server, is perfectly able to cope with more than they expected.

Having the information that Dirk was using a self written perl script (nice script by the way) to remotely manage virtual users (basically running the same java test application, using multiple servers across the network as agents), i tried running the test application, first locally.

I was quite surprised to first see that the test application was consuming somewhere about 50 % CPU of my computer’s resources. Bearing the fact that i have a dual core PC, i started a second client, just to see that the second CPU was being fully used, basically making use of all my CPU capacity…

Wow!!!

These were  just two virtual users, basically sending some requests to the server, no waiting times involved (the test application was built for throughput)…but this was no reason for such a bad performance.

Cpu Usage VisualVM

CPU Usage

Time for sampling

Before i start digging into profiling, it is a rule that i always sample. An article on differences between sampling and profiling was the subject of one of my posts ago, and can be found here: http://qants.wordpress.com/2010/08/26/cpu-profiling-sampling-versus-instrumentation/

I use the open source tool Visual VM for monitoring, sampling, dumping and so on, and the the configuration i use is taking samples every 20 milliseconds , refreshing the result set every two seconds. I do that for about 10 to 20 seconds, taking a snapshot at the end, in order to see the biggest consumer over the period of time.

So after identifying my biggest consumer, and analyzing the stack trace, i was pointed out to the following call, that was taking most of the CPU:

java.lang.thread.isAlive()

Why would a call, checking that a thread is alive, take so much of the CPU? I needed some invocation counts, i could not really believe that this could be the problem.

Time for profiling

As much as i like using Visual VM for sampling and other managing and monitoring activities, when it comes to profiling, try investing in a serious tool.

I use Your Kit Java Profiler, it is a great tool pointing you quite fast to the right bottlenecks. As profiling brings some overhead to the application, you do not want to profile when doing a load test, but rather do it with a small load configuration.

So i profiled the application for about 30 seconds

Java CPU Tracing with Your Kit

Java CPU Tracing with Your Kit Java Profiler

Taking a snapshot at the end of the tracing session, showed the following hotspots, and their invocation count:

Hot Spots as reported by Your Kit Profiler

Wow….there was a call, with an invocation count of 18 million times over 30 seconds, being responsible for almost 50 % CPU usage. Something was called too often, and definitely wrong.

So the next step was to analyse the code, just to see a loop checking the aliveness of the thread as long as the thread was running, something like

while (true)

{ if thread.isAlive() do something }

Time for solving the problem

We easily changed the code to

thread.join()

and totally gave up to actively checking the aliveness by calling thread.isAlive(). That gave us such a boost, that the agents were consuming less than 5 % CPU on the machines that were hosting them.

Time for re-test

There we were, building the test application with the latest improvements, running the test again on 8 remote servers, hoping for a throughput increase that would bring us at least closer to the JMeter results.

We were surprised to see that, although the CPU consumption dramatically decreased, the number of transactions was not anywhere near the JMeter results. So we decided to start the analysis all over again.

Time for CPU Monitoring

Starting a simple CPU monitor as Visual VM usually gives useful and reliable results. So that’s what we did. We connected our Visual VM to the Glassfish Application Server using JMX, and did a CPU monitoring session. The following pattern was the one that set us on the path to the next steps

Monitoring CPU Usage with Visual VM

Although we were expecting a constant load on the server, the monitoring session revealed patterns that i called “batches of work”. The interesting point in this graph was actually the pause between the batches, which was almost always 30 seconds long. Usually a pattern like this reveals either an application constant, or a configuration parameter of the web or application server.

Since checking the code has not revealed such parameters, we were left with the option of the web or application server parameter. But what could cause such a pattern? And even more, why was JMeter achieving the throughput with a constant load, and we weren’t?

It was now clear to me that somewhere a queue was building up, not allowing it to bypass it, and therefore limiting our throughput.

Put your brain off

Ok guys, so if JMeter can do it, why are we not able to do it? What is so different between us and JMeter ?

Once more:

What is so different between our test application and JMeter?

Well, in this case, we got to it quite fast. The one, and most important difference was the HTTP library we were using compared to JMeter. But still, we were using mature libraries, so how could something like this turn into such a problem.

Well, it’s always the small things that count (according to my girlfriend :)  ) Eventually it hit us to run a simple “neststat” in order to see the connections our application was opening compared to Jmeter.

Well, surprise. Jmeter was easily reaching the desired throughput by not having more than 10-15 open connections. Compared to it, we were steadily using 150, and we were never crossing this border.

So now we had two constants:

The border of 150 open connections and the 30 seconds pause between batches of work.

We separated ways, just to return on monday morning with different and amazing results.

Point One – My point

I remembered seing some parameters in the http listener of the Glassfish Application Server, limiting the number of open connections to 200, and the keep alive timeout to 30 seconds.

As we were using two different test configurations, one running directly over the application server, the other one over Apache and Mod JK, i needed to prove my theory by checking the default Apache configuration.

Little was i surprised to see that Apache comes with a default parameter of “maximum open connections” of 150, and a “keep alive timeout” of 30 seconds.

So, there it was…we were using all possible connections, and then the server made us wait for 30 seconds, the time it needed to reset the open keep alive connections, and take some more work from the queue….

So my question was…why in the world would we use that many connections

Point two – Dirk’s point

I guess Dirk’s eagerness was at least equally to mine, so i was actually expecting him on monday morning with his analysis results :)

Contrary to me, Dirk analysed the code, in order to find that we were opening a new HTTP connection for every possible request we were sending. A simple TCP  View while running the test revealed what we were so much fearing

Open Connections with TCP View

No such thing as Nirvana

This was an amazing example of team work, drill down, and problem solving. I am still amazed to date that we managed to identify and fix this problem in less than 2 days, running over somebody else’s code.

It showed me once more that there is no such thing as “out of nowhere” in such a deterministic environment like software.

It was a perfect opportunity of setting most of the performance tools and abilities into action, from CPU monitoring, sampling, profiling to detecting network limitations, analyzing network usage, etc.

By the end of today, we successfully proved our theory by setting the keep alive parameter on the Apache Server to off, and re-running the tests.

By the end of today, Lars, a great developer in our team jumped in to solve our HTTP connection problem.

By the end of today we were able to achieve 40 transactions per second, and that was only because i did not have access to Dirk’s perl script.

Tomorrow will be a great day for science

About these ads
  1. techzen
    November 6, 2010 at 2:31 am | #1

    informative! . how were the http connections then managed after the changes made?

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 28 other followers

%d bloggers like this: