My Adventures in Performance Tuning
Let’s Get Started!
In this post, I tell the story of how I tuned one of our own pages that was performing extremely badly under certain conditions. I will walk you through how I identified the main issue, what I considered doing, what I changes I made, what the results were, and what I might do next. Here it goes.
The Application Dashboard page was loading slowly for one of our customers. I was the one who worked to improve the performance of this page which is rendered by a Rails controller action, called ApplicationController#show. Now, the 99% case is a customer with 100 or fewer agents reporting for an application at any one time. The customer I’m looking at had thousands of agents reporting. All of a sudden, code tuned for 10 to 100 agents performed terribly. What an interesting challenge!
What’s Our Baseline To Improve On?
In New Relic, I found one of the customer’s Transaction Traces from the server-side:
Wow! 458 seconds! That’s a lot of time spent server-side. That’s so slow that even the browser has given up on us long before we’re done rendering the page. I’ve got to fix that.
Where’s the Time Going?
I noticed that 72% of the time spent was in Timeslice#find, the routine
that collects our time-series data. Thankfully, New Relic lets me drill
down to see what was going on in our Rails controller. When I did, I
saw two significant Timeslice#find nodes in the Transaction Trace.
What’s the Code Doing?
I saw two very slow database queries here. I clicked on the ‘View
Database Query’ button for both of the queries and I got access to their
backtraces. Here’s one of the backtraces:
Why’s the Code Doing That?
But what were we using this data for, I wondered? What were we trying
to achieve? I dug into the code and it was clear that we were querying
for this data in order to modify how the page is rendered. If the
application we’re monitoring is a Solr service, we render the page
differently. The code discovers this by asking the following question:
Of all the agents that reported for the time window the user is looking
at, are any configured to instrument a Solr service? This is a very
expensive query, especially when dealing with thousands of agents. So,
what could I do about it?
What Options Do I Have?
I considered the options:
- I was surprised to discover that we were inadvertently querying for the same information twice. Both backtraces have the same RealAgent#ids_in_time_window call in common. I could have simply cached the result to remove the second query. The main problem with caching is that it creates bumps for the user. It’s no good if your application is very fast 99% of the time, and 1% of the time you cache-miss and the application performs poorly.
- The query is so slow because we were limiting the scope of our query to active agents in the user’s time window. While very accurate, it can be very expensive. I could have traded correctness for speed and ignored the time window completely. I noticed that the third largest culprit in the Transaction Trace, the SQL – SELECT component, was performing a related query. Once I eliminated the first and second largest culprits I needed to resolve the SQL – SELECT. The SQL – SELECT component answered the following question: Are any of the agents configured to instrument a Solr service? I wanted to eliminate having to query for this too, if I can.
- The result of the ultimate question, ‘Are you a Solr application?’ was actually fairly static. Again, I could have cached the result. Applications typically don’t oscillate between being a Solr application and a typical application. Right now we’re recomputing this information per request. Instead, we could incrementally pay this cost and precompute this as new data comes in. It’s a trade-off but it will buy us a big performance boost! This is the solution I pursued.
Make the Code Change.
I investigated, and the part of the system that collects incoming
metric data from agents already maintains tags based on the metrics
being reported, so we already knew whether a customer’s application is
using Solr or not. I changed the code to refer to this metadata instead,
easy! At this point, it was important not to gloss over the trade-off,
though. Doing this means once an application indirectly declares itself
as being a Solr application, it never goes back to being a non-Solr
application. I accepted this was the trade-off in exchange for the
performance gains.
Measure the Improvement.
The page is now performing an order of a magnitude faster. Compare our
first sample of 468 seconds to the new sample of 43 seconds!
Where Do I Go From Here?
Even though I’ve dealt with the issue that our first Transaction Trace
gave us, there’s more I can do. Here’s how I think I might proceed:
- The RealAgent#find_by_sql needs to be dealt with. The majority of the GC execution is directly attributable to this call. We’re using Rails 2, so we can’t make use of ActiveRecord’s pluck. So there’s opportunity here for improvement. Even better, I think we can eliminate the need to look these up from the database entirely.
- UIBaseHelper.applications_dropdown appears to have some expensive sorting logic that’s causing the CPU to burn. I should take a look.
- It appears the SQL – SELECT component is determining whether the application uses a database, much in the same way as the Solr check was being made. I suspect this can be computed ahead of time as well.
- On the Application Dashboard page we display Traced Errors and Transaction Traces in separate portions of the page, which explains why TracedError#find and TransactionTrace#find both show up here. One option I have is to defer loading and rendering these two portions of the page by requesting them via AJAX.
Let’s Review!
In my performance tuning, I avoided the temptation to speculate and
made informed decisions based on real production data. Here, I started
out with a performance issue under specific circumstances that didn’t
reflect the normal performance characteristics of our website. Then,
using New Relic, I narrowed down the problem to the server-side and
pinpointed the bottleneck. After careful consideration, I decided on a
trade-off that subtly changes the behavior of our site by taking
advantage of pre-existing metadata in the system. This trade-off
resulted in an order of magnitude performance improvement, and after I
deployed the change I was able to verify the performance gain in
production. I feel satisfied and armed with my deep application
visibility provided by New Relic, and I went on to identify additional
areas for performance tuning.










Comments
Howard Lewis Ship replied on Mon, 2012/07/09 - 12:08pm
Eric Genesky replied on Mon, 2012/07/09 - 4:02pm
in response to:
Howard Lewis Ship