måndag 5 oktober 2009

Fighting with a JVM bug

Had a really annoying issue on of the servers we host. We had glassfish v2 b58g running behind a load balancer and running on java 1.6.0_12. Everything seemed to work until i realized in our monitoring system zabbix that the connection was really unstable. In some cases the system was not accessible at all.
By using telnet i could see following behavior:

[user@jump ~]$ time wget --timeout=30 --tries=1 https://service.com/somesite
--08:30:07-- https://service.com/somesite
=> `service.3'
Resolving service.com... 199.44.48.112
Connecting to service.com|199.44.48.112|:443... connected.
Unable to establish SSL connection.

real 0m10.109s
user 0m0.009s
sys 0m0.025s
[user@jump ~]$

(some values are changed :-)

The interesting thing was that it always takes 10 seconds before the error happends.
Next thing was to use tcpdump and identify what is happening. It showed a lot of connections where closed with TCP RST. And glassfish seemed to throw and exception as it did not know either why the connections died. Anyway, some googling and i first found this ticket talking about possible error.
It lead me to a java bug 6403933 that still is not solved. in the latest JRE:s. And this problem was not in java 5 so it explains why i had not this problem before.

So, thanks to all my new knowledge i thought that upgrading glassfish to v2.1 b60g would help. And it did :-)