Friday, October 29, 2010

53734 NFSv4 ops/sec

Not bad.









Still more than enough headroom left...

Friday, October 22, 2010

Adventures in Application Performance Management: Part II

Firing up AppDynamics inside the browser shows a list of application agents, and also external systems being called by our application.

The nice thing is, AppDynamic automatically detects calls to external systems, like WebServices etc.

Grouping the agent and surrounding systems a little bit, AppDynamics presents us a nice dashboard, with the most important information.


















The large area shows the calls to other systems.

On the bottom we see the load (calls/minute) and average response time:























As you can see, the number of calls goes down, while the response time goes up at the same time. A clear case of a bottleneck...

To find the reason for this, we look at the right side of the dashboard.











AppDynamics automatically classifies requests into categories (can be adjusted).  We can clearly see that we have 1.2% Stalls, for this timeperiod.

We can further see which were the top transactions by load and by response time. 




























Transactions are autodetected by AppDynamics and will be monitored automatically.

You can look at suspicious transactions, or look at stalled requests. AppDynamics automatically takes request snapshots including call graphs.

Back to our performance problem...

After finally seeing how the application works internally, we found the reason for our performance problem within hours. We had a method that was synchronised. During the peak-hours, requests (many!) had to wait. This caused not only a huge slow-down, but also lots of objects that had to be collected by the garbage collector, and caused long application pauses

After fixing this single method, our application has been faster than ever...

Adventures in Application Performance Management: Part I

Who follows my blog, knows that I'm a Splunk addict, because I really like to know what my applications and systems are doing.

Although Splunk is my favorite tool in my toolbox (and will be in the future... :-), there are some blind spots it can't see.

We have struggled with some serious performance problems in one of our core applications during peak-hours.

The application is Java-based, and usually performs well, when everything is ok. But during peak-hours, the response time gets worse and worse, having the side-effect of long major garbage collections. Not very user friendly, when there is a long stop-the-world.

We were looking at the problem from the top (log analysis, monitoring) to the bottom (gc logs, jprofiler) never really finding the root cause of the problem. The fact that the problem did not occur all the time did not make it easier...

As the situation got worse over time, and adding even more hardware was not really a solution, we were looking for some external help.

Lucky for us, the company we asked for help brought in a brand new Application Performance Management tool called AppDynamics, which was recently released.

Installing of AppDynamics was a no-brainer. It took 5 Minutes to install the server, and only a few minutes to install the Agents. Only a jar-file needs to be loaded into the JVM as an agent. The agent only needs to know where it can find the AppDynamics Server. Communication is done through http.

Now we were ready to find the root cause.

Tuesday, October 19, 2010

Countdown to ONE BILLION FILES!

I was reading Ric Wheelers "One Billion Files: Scalability Limits in Linux Filesystems". As a ZFS user, I was wondering how many files we store on one of our mail storage systems in a single zpool.

My colleague was so kind to start a find on the system. Four days later we got

# find /export -type f | wc -l 
 811874848

Interesting. We are already close to one billion files in production.

Next step was to look at the average file size. Currently, 12.5TB are referenced, compression ratio is 1.77x. This results in a average size of ~30kB.

How long will it take to reach one billion files? 

In average, 60 mails get delivered per second to the storage system (over NFSv4!). Therefore, to get the missing 190 million files, we only need to wait a little bit more than a month.