J2EE – WELD vs EJB injection in JSF Managed Beans – Getting to the bottom
This is an interesting story of how an upgrade of the application server, together with a switch to new technologies, designed to make programming easier, turned into a real production problem. A classic story of how things that used to work before, stopped working after the upgrade, and, again, a typical, shorter way to identify the problem.
Stage 1 – Using direct EJB injection in JSF managed beans – Glassfish 2.1
Before upgrading to Glassfish 3.0.1, we encountered all sort of performance problems, but pretty much all of them were in our application, so that allowed us to fix them in our realm, instead of waiting for some patch or official component update.
The last problem i remember having before doing the upgrade, was that we did not clear the entity manager once our entities were no longer used. That led to an increase in Heap Usage, and eventually to a classical Out of Memory error. So i knew that i had to restart the application server periodically in order to have this problem resolved. Well, the problem with a memory leak is that it can cover other memory leaks, that you will only detect once you get rid of the one on the surface.
Well, before concluding the entity manager problem, we decided on upgrading to Glassfish 3.0.1. One of the reasons for doing that was also the integration of WELD, which basically made EJB injection much easier. Little did we know where we were heading too.
Stage 2 – Using WELD for CDI ( Contextual Dependency Injection) – Glassfish 3.0.1
After upgrading to Glassfish 3.0.1, it took us less than a week or two to discover that the biggest memory leak was because of not clearing the entity manager. We always knew who was occupying the heap space (of course it was us) , we just did not know why.
Once we had the problem cleared, we were once again ready to conquer the world. Time for a regression test, with a relatively high load test configuration ( three JMeter agents, running 50 Virtual Users each, leading to some 100 – 120 business transactions / second)
As mentioned in a previous post, there is only one king when it comes to performance testing: RESPONSE TIMES.
So i decided to give it a go. I tested on the same Application Server configuration (heap, thread pool, garbage collection strategy), only with a newer version of it ( 3.0.1 instead of 2.1), with a new version of our application, having the entity manager problem solved.
After running some baseline tests, and gradually increasing the load, i came up with the following results:
The first thing that draws for attention here is the decreasing number of transactions / second ( blue lines), in a very good proportion with the increasing response times (the colored lines) Something was wrong, i needed to see what and why.
Now, let’s start the analysis
Was it the system (context switching, i/o, network, thread locking) or was it the application (garbage collection, code) ? This should always be your starting point. It is more than simple to run a “top” command and see where the CPU is spending time.
In my case, it was spending time in user space, so i was able to leave out any thread locking or context switching scenarios.
So, the dominating consumer was in the user space. What next?
Well, the garbage collection logs tell SO much about the behavior of the JVM running your application. So i decided to take a look on the log, with the following results:
One can see that after the test has finished, the next full garbage collection was not able to free any more resources. Even if we are talking about loitering objects, with very high object lifetime, a full garbage collection after all the users have logged out should free up any used space, thereby decreasing the heap usage to a value no higher than when the test has started.
The blue lines represent the young garbage collections while running the test, whereas the yellow ones and pink ones represent Full Garbage Collections.
So this was a clear sign that somehow, somewhere, there was a memory leak. BUT, at this moment, my biggest question was not why there is a memory leak, but how can a memory leak cause depravation of response times and throughput in such a dramatical way?
I needed to know who was hanging on to those objects, and how was that affecting the overall performance of the application.
One more thing that occurred to me was that with time, the CPU usage was also constantly increasing. I mean, memory leak was one thing, but why in the world would the CPU usage increase ?
So basically i had the following information:
- Decreasing throughput combined with increasing response times
- Memory leak
- Increasing CPU usage as the load test evolved in time
Did i know anything about WELD ? NO!
Did i have to know anything about WELD? NO!
I needed to stick to the facts, and see the real problems, independent of application or technology used. There is no such thing as NIRVANA, remember?
I needed to sample / profile
Sampling / Profiling
I repeated the test with the same configuration, and, as usual in my analyzing sessions, i starting sampling over a period of 20 ms, with a refresh rate of two seconds.
After a while, i could see that the most time was consumed while operating on hash maps. Hmmm. I know that hash maps are pretty expansive once they grow too big, but … why were they growing big? The sampling session returned the following consumers:
Wow….that was a lot of time consumed in searching the hash maps! Taking a snapshot revealed the following hotspots:
So there it was, clear as dawn of a summer day : weld.servlet.BeanProvider.httpSessionManager
It was now clear to me that every session that was created, and accessed the Weld Bean Manager, was holding references and not letting go, even after logging the users out, even after the server session timeout.
I did not need to restart the test, i just needed to take a look in the heap map. If it was like this, i should have had a session left for each and every one of the virtual users i have used.
Analyzing the Heap Map
In order to analyze the current heap map, you can easily do so by calling the “jmap” utility, part of the JDK. The syntax is:
jmap -histo java_pid
In order to prove my theory, i grepped after EJB31:
jmap -histo java_pid | grep EJB31
150 entries of applicationname.__EJB31_Generated__UserSession__Intf____Bean__
150 entries of applicationname.__EJB31_Generated__WebSession__Intf____Bean__
( i replaced internal names with dummy name)
Digging for the problem
Well, the disadvantage when working with new components and technologies, is that you are most of the time a pioneer. And when you integrate that with the newest application server, well, you will surely get in trouble. I decided to dig for the problem, and see what we can do about it. I will not get into much detail. Important is that the bug was reported in WELD, and was because of the WELD proxies, which were causing memory leaks. Further information in the links below.
https://jira.jboss.org/browse/WELD-483 – Client proxies are using the wrong CreationalContexts sometimes
https://jira.jboss.org/browse/JASSIST-97 – Javassist doesn’t serialize the handler along with the proxy
http://forums.java.net/jive/thread.jspa?messageID=480423&tstart=0 – 90 % of memory was used by org.jboss.weld.* instances
https://jira.jboss.org/browse/WELD-570 – Partial export of javasisst from Weld breaks other javasisst clients
http://forums.java.net/jive/thread.jspa?messageID=479092 – glassfish v3.01 CDI with jersey – memory leak ?
http://markmail.org/search/?q=list%3Anet.java.dev.jersey.users#query:list%3Anet.java.dev.jersey.users+page:2+mid:mivoffy5wawwdcev+state:results – memory leak with cdi
Dealing with the problem
Now, as we knew for sure that this was a component problem (WELD in our case), Lars thought of turning back to plain old JSF for EJB injection, and therefore bypassing the WELD component. So he came yesterday to me, with a redesigned version of our application, completely WELD free.
These were the results ( pretty much the same load configuration)
Visual VM Monitoring of CPU and Heap Usage
And this was the memory footprint after analyzing the garbage collection log:
Just great. After the users log out, the next full garbage collection frees up the space, completely!
Again, a typical case where one does not need to know much about the application. One must remember what he can see: CPU, Heap Usage, Network, I/O, Consumers.
As for us, we will probably wait for an update of Glassfish to release 3.1, where the WELD problem has been reported as fixed.
Tomorrow will be a great day for science, again.