So much of life is about timing. Just ask David Davis. He was arrested after getting into a scuffle whilst having his hair cut:
Bad timing, right?
But that’s not really the kind of timing I’m talking about. When you’re performance tuning an application an understanding of timing is crucial to success – it can reveal truth that would otherwise remain masked. In this post I want to cover three topics:
- The different types of timing data you can collect, and the best way to use them,
- Absolute versus relative timing measures, and
- The effect of profiling method (instrumentation versus sampling) on the timing data you collect.
Let’s start off with the first…
Regardless of your processor architecture, operating system, or technology platform most (good) performance profiling software will use the most accurate timer supported by your hardware and OS. On x86 and x64 processors this is the Time Stamp Counter, but most other architectures have an equivalent.
From this timer it’s possible to derive a couple of important metrics of your app’s performance:
- CPU time – that is, the amount of time the processor(s) spend executing code in threads that are part of your process ONLY – i.e., exclusive of I/O, network activity (e.g., web service or web API calls), database calls, child process execution, etc.
- Wall clock time – the actual amount of time elapsed executing a particular piece of code, such as a method, including I/O, network activity, etc.
Different products might use slightly different terminology, or offer subtly differing flavours of these two metrics, but the underlying principles are the same. For this post I’ll show the examples using ANTS Performance Profiler but you’ll find that everything I say is also applicable to other performance tools, such as DotTrace, the Visual Studio Profiling Tools, and JProfiler, so hopefully you’ll find it useful.
The really simple sequence diagram below illustrates the differences between CPU time and wall clock time for executing a method called SomeMethod(), which we’ll assume is in a .NET app, that queries a SQL Server database.
The time spent actually executing code in SomeMethod() is represented by regions A and C. This is the CPU time for the method. The time spent executing code in SomeMethod() plus retrieving data from SQL Server is represented by regions A, B, and C. This represents the wall clock time – the total time elapsed whilst executing SomeMethod(). Note that, for simplicity’s sake:
- I’ve excluded any calls SomeMethod() might make to other methods in your code, into the .NET framework class libraries, or any other .NET libraries. Were they included these would form part of the CPU time measurement because this is all code executing on the same thread within your process.
- I’ve excluded network latency from the diagram, which would form part of the wall clock time measurement.
Most good performance profilers will allow you to switch between CPU and wall clock time. All the profilers I mentioned above support this. Here’s what the options look like in ANTS Performance Profiler; other products are similar:
There’s also the issue of time in method vs. time with children. Again the terminology varies a little by product but the basics are:
- Time in method represents the time spent executing only code within the method being profiled. It does not include callees (or child methods), or any time spent sleeping, suspended, or out of process (network, database, etc.). It follows from this that the absolute value of time in method will be the same regardless of whether you’re looking at CPU time, or wall clock time.
- Time with children includes time spent executing all callees (or child methods). When viewing wall clock time it also includes time spent sleeping, suspended, and out of process (network, database, etc.).
OK, let’s take a look at an example. Here’s a method call with CPU time selected:
And here’s the same method call with wall clock time selected:
Note how in both cases Time (ms), which represents time in method, is the same at 0.497ms, but that with wall clock time selected the time with children is over 40 seconds as opposed to less than half a second. We’ll take a look at why that is in a minute. For now all you need to understand is that this is time spent out of process, and it’s the kind of problem that can easily be masked if you look at only CPU time.
All right, so how do you know whether to look at CPU time or wall clock time? And are there situations where you might need to use both?
Many tools will give you some form of real-time performance data as you use them to profile your apps. ANTS Performance Profiler has the timeline; other tools have a “telemetry” view, which shows performance metrics. The key is to use this, along with what you know about the app to gain clues as to where to look for trouble.
The two screengrabs above are from a real example on the ASP.NET MVC back-end support systems for a large B2B ecommerce site. They relate to the user clicking on an invoice link from the customer order page. As you’d expect this takes the user to a page containing the invoice information, but the page load was around 45 seconds, which is obviously far too long.
Here’s what the timeline for that looked like in ANTS Performance Profiler:
(Note that I’ve bookmarked such a long time period not because the profiler adds that much overhead, but because somebody asked me a question whilst I was collecting the data so there was a delay before a clicked Stop Live Bookmark!)
As you can see, there’s very little CPU activity associated with the worker process running the site; just one small spike over to the left.
This tells you straight away that the time isn’t being spent on lots of CPU intensive activity in the website code. Look at this:
We’re viewing CPU time and there’s nothing particularly horrendous in the call tree. Sure, there’s probably some room for optimisation, but absolutely nothing that would account for the observed 45 second page load.
Switch to wall clock time and the picture changes:
Hmm, looks like the problem might be those two SQL queries, particularly the top one! Maybe we should optimise those*.
Do you see how looking at the “wrong” timing metric masked the problem? In reality you’ll want to use both metrics to see what each can reveal and you’ll quickly get to know which works best in different scenarios as you do more performance tuning.
By the way: for those of you working with Java, JProfiler has absolutely great database support with multiple providers for different RDBMSs. I would highly recommend you check it out.
You may have noticed that throughout the above examples I’ve been looking at absolute measurements of time, in this case milliseconds. Ticks and seconds are often also available, but many tools often offer relative measurements – generally percentages – in some cases as the default unit.
I find relative values often work well when looking at CPU time but that, generally, absolute values are a better bet for wall clock time. The reason for this is pretty simple: wall clock time includes sleeping, waiting, suspension, etc., and so often your biggest “bottleneck” can appear to be a single thread that mostly sleeps, or waits for a lock (e.g., the Waiting for synchronization item in the above screenshots). This will often be something like the GC thread and the problem is, without looking at absolute values, you’ve no real idea how significant the amounts of time spent in other call stacks really are. Switching to milliseconds or (for really gross problems – the above would qualify) seconds can really help.
Let’s talk about instrumentation versus sampling profiling and the effect this has on timings.
Instrumentation is the more traditional of the two methods. It actually modifies the running code to insert extra instructions that collect timing values throughout the code. For example, instructions will be inserted at the start and end of methods and, depending upon the level of detail selected, at branch points in the code, or at points which mark the boundaries between lines in the original source. Smarter profilers need only instrument branch points to accurately calculate line level timings and will therefore impose less overhead in use.
Back in the day this modification would be carried out on the source code, and this method may still be used with C++ applications. The code is modified as part of the preprocessor step. Alternatively it can be modified after compilation but before linking.
Nowadays, with VM languages, such as those that run in the JVM or the .NET CLR, the instrumentation is generally done at runtime just before the code is JITed. This has a big advantage: you don’t need a special build of your app in order to diagnose performance problems, which can be a major headache with older systems such as Purify.
Sampling is available in more modern tools and is a much lower overhead, albeit less detailed, method of collecting performance data. The way it works is that the profiler periodically takes a snapshot of the stack trace of every thread running in the application. It’ll generally do this many times a second – often up to 1,000 times per second. It can then combine the results from the different samples to work out where most time is spent in the application.
Obviously this is only good for method level timings. Moreover methods that execute very quickly often won’t appear in the results at all, or will have somewhat skewed timings (generally on the high side) if they do. Timings for all methods are necessarily relative and any absolute timings are estimates based on the number of samples containing each stack trace relative to the overall length of the selected time period.
Furthermore most tools cannot integrate ancillary data with sampling. For example, ANTS Performance Profiler will not give information about database calls, or HTTP requests, in sampling mode since this data is collected using instrumentation, which is how it is able to tell you – for example – exactly where queries were executed.
Despite these disadvantages, because of its low overhead, and because it doesn’t require modification of app code, sampling can often be used on a live process without the need for a restart before and after profiling, so can often be a good option for apps in production.
The effect of all of this on timing measurements if you’ve opted for sampling rather than instrumentation profiling is that the choice of wall clock time or CPU time becomes irrelevant. This is because whilst your profiler knows the call stack for each thread in every sample, it probably won’t know whether or not the thread was running (i.e., it could have been sleeping, suspended, etc.) – figuring this out could introduce unacceptable overhead whilst collecting data. As a result you’ll always be looking at wall clock time with sampling, rather than have the choice as you do with instrumentation.
Hopefully you’re now equipped to better understand and use the different kinds of timing data your performance profiler will show you. Please do feel free to chime in with questions or comments below – feedback is always much appreciated and if you need help I’d love to hear from you.
*Optimising SQL is beyond the scope of this post but I will cover it, using a similar example, in the future. For now I want to focus on the different timing metrics and what they mean to help you understand how to get the best out of your performance profiler. That being said, your tool might give you a handy hint so it’s not even as if you need to do that much thinking for yourself (but you’ll still look whip sharp in front of your colleagues)…
Just don’t let them get a good look at your screen!