Category Archives: Profiling

Current talk list 2016: web and database performance

It’s that time of the year where, for me, talk proposals are submitted. I also tend to take it as an opportunity to refresh and rework talks.

This year I’ve submitted talks for DDD, DDD North, and NDC London (this one’s a bit of a long shot), and am keeping my eye out for other opportunities. I’ll also be giving talks at the Derbyshire .NET User Group, and DDD Nights in Cambridge in the autumn.

Voting for both DDD and DDD North is now open so, if you’re interested in any of the talks I’ve listed below, please do vote for them at the following links:

Here are my talks. If you’d like me to give any of them at a user group, meetup, or conference you run, please do get in touch.

Talk Title: How to speed up .NET and SQL Server web apps

Performance is a critical aspect of modern web applications. Recent developments in hardware, software, infrastructure, bandwidth, and connectivity have raised expectations about how the web should perform.

Increasingly this attitude is applied to internal line of business apps, and niche sites, as much as to large public-facing sites. Google even bases your search ranking in part on how well your site performs. Being slow is no longer an option.

Unfortunately, problems can occur at all layers and in all components of an application: database, back-end code, systems integrations, local and third party services, infrastructure, and even – increasingly – the client.

Complex apps often have problems in multiple areas. How do you go about tracking them down and fixing them? Where do you begin?

The answer is you deploy the right tools and techniques. The good news is that generally you can do this without changing your development process. Using a number of case studies I’m going to show you how to track down and fix performance issues. We’ll talk about the tools I used to find them, and the fixes that resulted.

That being said, prevention is better than cure, so I’ll also talk about how you can go about catching problems before they make it to production, and monitor to get earlier notification of trouble brewing.

By the end you should have a plethora of tools and techniques at your disposal that you can use in any performance analysis situation that might confront you.

Talk Title: Premature promotion produces poor performance: memory management in the CLR and JavaScript runtimes

The CLR, JVM, and well-known JavaScript runtimes provide automatic memory management with garbage collection. Developers are encouraged to write their code and forget about memory management entirely. But whilst ignorance is bliss, it can also lead to a host of problems further down the line.

With web applications becoming ever more interactive, and the meteoric rise in popularity of mobile browsers, the kind of performance and resource usage issues that once only concerned back-end developers have now become common currency on the client as well.

In this session we’ll look at how these runtimes manage memory and how you can get the best out of them. We’ll discuss the “classic” blunders that can trip you up, and how you can avoid them. We’ll also look at the tools that can help you if and when you do run into trouble, both on the client and the server.

You should come away from this session with a good understanding of managed memory, particularly as it relates to the CLR and JavaScript, and how you can write code that works with the runtimes rather than against them.

Talk Title: Optimizing client-side performance in interactive web applications

Web applications are becoming increasingly interactive. As a result, more code is shifting to the client, and JavaScript performance has become a key factor for many web applications, both on desktop and mobile. Just look at this still ongoing discussion kicked off by Jeff Atwood’s “The State of JavaScript on Android in 2015 is… poor” post:

Devices nowadays offer a wide variety of form factors and capabilities. On top of this, connectivity – whilst widely available across many markets – varies considerably in quality and speed. This presents a huge challenge to anyone who wants to offer a great user experience across the board, along with a need to carefully consider what actually constitutes “the board”.

In this session I’m going to show you how to optimize the client experience. We’ll take an in depth look at Chrome Dev Tools, and how the suite of debugging, data collection and diagnostic tools it provides can help you diagnose and fix performance issues on the desktop and Android mobile devices. We’ll also take a look at using Safari to analyse and debug web applications running on iOS.

Throughout I’ll use examples from to illustrate. is an HTML5, JavaScript, and CSS games site. Currently it hosts a version of Star Castle, called Star Citadel, but I’m also working on versions of Asteroids (Space Rawks!), and Space Invaders (yet to find an even close to decent name). It supports both desktop and mobile play. Whilst this site hosts games the topics I cover will be relevant for any web app featuring a high level of interactivity on the client.

Talk Title: Complex objects and microORMs: an introduction to the Dapper.SimpleLoad and Dapper.SimpleSave extensions for StackExchange’s Dapper microORM

Dapper ( is a popular microORM for the .NET framework that provides simple way to map database rows to objects. It’s a great alternative when speed is of the essence, and when you just don’t need the functionality offered by EF.

But what happens when you want to do something a bit more complicated? What happens if you want to join across multiple tables into a hierarchy composed of different types of object? Well,then you can use Dapper’s multi-mapping functionality… but that can quickly turn into an awful lot of code to maintain, especially if you make heavy use of Dapper.

Step in Dapper.SimpleLoad (, which handles the multi-mapping code for you and, if you want it to, the SQL generation as well.

So far so good, but what happens when you want to save your objects back to the database?

With Dapper it’s pretty easy to write an INSERT, UPDATE, or DELETE statement and pass in your object as the parameter source. But if you’ve got a complex object this, again, can quickly turn into a lot of code.

Step in Dapper.SimpleSave (, which you can use to save changes to complex objects without the need to worry about saving each object individually. And, again, you don’t need to write any SQL.

I’ll give you a good overview of both Dapper.SimpleLoad and Dapper.SimpleSave, with a liberal serving of examples. I’ll also explain their benefits and drawbacks, and when you might consider using them in preference to more heavyweight options, such as EF.

Timing is everything in the performance tuning game: learn to choose the right metrics to hunt down bottlenecks

So much of life is about timing. Just ask David Davis. He was arrested after getting into a scuffle whilst having his hair cut:

David Davis with half a haircut in his police mugshot.

Bad timing, right?

But that’s not really the kind of timing I’m talking about. When you’re performance tuning an application an understanding of timing is crucial to success – it can reveal truth that would otherwise remain masked. In this post I want to cover three topics:

  • The different types of timing data you can collect, and the best way to use them,
  • Absolute versus relative timing measures, and
  • The effect of profiling method (instrumentation versus sampling) on the timing data you collect.

Let’s start off with the first…

Regardless of your processor architecture, operating system, or technology platform most (good) performance profiling software will use the most accurate timer supported by your hardware and OS. On x86 and x64 processors this is the Time Stamp Counter, but most other architectures have an equivalent.

From this timer it’s possible to derive a couple of important metrics of your app’s performance:

  • CPU time – that is, the amount of time the processor(s) spend executing code in threads that are part of your process ONLY – i.e., exclusive of I/O, network activity (e.g., web service or web API calls), database calls, child process execution, etc.
  • Wall clock time – the actual amount of time elapsed executing a particular piece of code, such as a method, including I/O, network activity, etc.

Different products might use slightly different terminology, or offer subtly differing flavours of these two metrics, but the underlying principles are the same. For this post I’ll show the examples using ANTS Performance Profiler but you’ll find that everything I say is also applicable to other performance tools, such as DotTrace, the Visual Studio Profiling Tools, and JProfiler, so hopefully you’ll find it useful.

The really simple sequence diagram below illustrates the differences between CPU time and wall clock time for executing a method called SomeMethod(), which we’ll assume is in a .NET app, that queries a SQL Server database.

Sequence diagram illustrating the difference between wall clock and CPU time.

The time spent actually executing code in SomeMethod() is represented by regions A and C. This is the CPU time for the method. The time spent executing code in SomeMethod() plus retrieving data from SQL Server is represented by regions A, B, and C. This represents the wall clock time – the total time elapsed whilst executing SomeMethod(). Note that, for simplicity’s sake:

  • I’ve excluded any calls SomeMethod() might make to other methods in your code, into the .NET framework class libraries, or any other .NET libraries. Were they included these would form part of the CPU time measurement because this is all code executing on the same thread within your process.
  • I’ve excluded network latency from the diagram, which would form part of the wall clock time measurement.

Most good performance profilers will allow you to switch between CPU and wall clock time. All the profilers I mentioned above support this. Here’s what the options look like in ANTS Performance Profiler; other products are similar:

Timing options in Red Gate's ANTS Performance Profiler

There’s also the issue of time in method vs. time with children. Again the terminology varies a little by product but the basics are:

  • Time in method represents the time spent executing only code within the method being profiled. It does not include callees (or child methods), or any time spent sleeping, suspended, or out of process (network, database, etc.). It follows from this that the absolute value of time in method will be the same regardless of whether you’re looking at CPU time, or wall clock time.
  • Time with children includes time spent executing all callees (or child methods). When viewing wall clock time it also includes time spent sleeping, suspended, and out of process (network, database, etc.).

OK, let’s take a look at an example. Here’s a method call with CPU time selected:

CPU times for method

And here’s the same method call with wall clock time selected:

Wall clock times for method

Note how in both cases Time (ms), which represents time in method, is the same at 0.497ms, but that with wall clock time selected the time with children is over 40 seconds as opposed to less than half a second. We’ll take a look at why that is in a minute. For now all you need to understand is that this is time spent out of process, and it’s the kind of problem that can easily be masked if you look at only CPU time.

All right, so how do you know whether to look at CPU time or wall clock time? And are there situations where you might need to use both?

Many tools will give you some form of real-time performance data as you use them to profile your apps. ANTS Performance Profiler has the timeline; other tools have a “telemetry” view, which shows performance metrics. The key is to use this, along with what you know about the app to gain clues as to where to look for trouble.

The two screengrabs above are from a real example on the ASP.NET MVC back-end support systems for a large B2B ecommerce site. They relate to the user clicking on an invoice link from the customer order page. As you’d expect this takes the user to a page containing the invoice information, but the page load was around 45 seconds, which is obviously far too long.

Here’s what the timeline for that looked like in ANTS Performance Profiler:

ANTS Performance Profiler timeline for navigating from order to invoice page on internal support site.

(Note that I’ve bookmarked such a long time period not because the profiler adds that much overhead, but because somebody asked me a question whilst I was collecting the data so there was a delay before a clicked Stop Live Bookmark!)

As you can see, there’s very little CPU activity associated with the worker process running the site; just one small spike over to the left.

This tells you straight away that the time isn’t being spent on lots of CPU intensive activity in the website code. Look at this:

Call tree viewing CPU time - doesn't look like there's much amiss.

We’re viewing CPU time and there’s nothing particularly horrendous in the call tree. Sure, there’s probably some room for optimisation, but absolutely nothing that would account for the observed 45 second page load.

Switch to wall clock time and the picture changes:

Call graph looking at wall clock time - now we're getting somewhere!

Hmm, looks like the problem might be those two SQL queries, particularly the top one! Maybe we should optimise those*.

Do you see how looking at the “wrong” timing metric masked the problem? In reality you’ll want to use both metrics to see what each can reveal and you’ll quickly get to know which works best in different scenarios as you do more performance tuning.

By the way: for those of you working with Java, JProfiler has absolutely great database support with multiple providers for different RDBMSs. I would highly recommend you check it out.

You may have noticed that throughout the above examples I’ve been looking at absolute measurements of time, in this case milliseconds. Ticks and seconds are often also available, but many tools often offer relative measurements – generally percentages – in some cases as the default unit.

I find relative values often work well when looking at CPU time but that, generally, absolute values are a better bet for wall clock time. The reason for this is pretty simple: wall clock time includes sleeping, waiting, suspension, etc., and so often your biggest “bottleneck” can appear to be a single thread that mostly sleeps, or waits for a lock (e.g., the Waiting for synchronization item in the above screenshots). This will often be something like the GC thread and the problem is, without looking at absolute values, you’ve no real idea how significant the amounts of time spent in other call stacks really are. Switching to milliseconds or (for really gross problems – the above would qualify) seconds can really help.

Let’s talk about instrumentation versus sampling profiling and the effect this has on timings.

Instrumentation is the more traditional of the two methods. It actually modifies the running code to insert extra instructions that collect timing values throughout the code. For example, instructions will be inserted at the start and end of methods and, depending upon the level of detail selected, at branch points in the code, or at points which mark the boundaries between lines in the original source. Smarter profilers need only instrument branch points to accurately calculate line level timings and will therefore impose less overhead in use.

Back in the day this modification would be carried out on the source code, and this method may still be used with C++ applications. The code is modified as part of the preprocessor step. Alternatively it can be modified after compilation but before linking.

Nowadays, with VM languages, such as those that run in the JVM or the .NET CLR, the instrumentation is generally done at runtime just before the code is JITed. This has a big advantage: you don’t need a special build of your app in order to diagnose performance problems, which can be a major headache with older systems such as Purify.

Sampling is available in more modern tools and is a much lower overhead, albeit less detailed, method of collecting performance data. The way it works is that the profiler periodically takes a snapshot of the stack trace of every thread running in the application. It’ll generally do this many times a second – often up to 1,000 times per second. It can then combine the results from the different samples to work out where most time is spent in the application.

Obviously this is only good for method level timings. Moreover methods that execute very quickly often won’t appear in the results at all, or will have somewhat skewed timings (generally on the high side) if they do. Timings for all methods are necessarily relative and any absolute timings are estimates based on the number of samples containing each stack trace relative to the overall length of the selected time period.

Furthermore most tools cannot integrate ancillary data with sampling. For example, ANTS Performance Profiler will not give information about database calls, or HTTP requests, in sampling mode since this data is collected using instrumentation, which is how it is able to tell you – for example – exactly where queries were executed.

Despite these disadvantages, because of its low overhead, and because it doesn’t require modification of app code, sampling can often be used on a live process without the need for a restart before and after profiling, so can often be a good option for apps in production.

The effect of all of this on timing measurements if you’ve opted for sampling rather than instrumentation profiling is that the choice of wall clock time or CPU time becomes irrelevant. This is because whilst your profiler knows the call stack for each thread in every sample, it probably won’t know whether or not the thread was running (i.e., it could have been sleeping, suspended, etc.) – figuring this out could introduce unacceptable overhead whilst collecting data. As a result you’ll always be looking at wall clock time with sampling, rather than have the choice as you do with instrumentation.

Hopefully you’re now equipped to better understand and use the different kinds of timing data your performance profiler will show you. Please do feel free to chime in with questions or comments below – feedback is always much appreciated and if you need help I’d love to hear from you.

*Optimising SQL is beyond the scope of this post but I will cover it, using a similar example, in the future. For now I want to focus on the different timing metrics and what they mean to help you understand how to get the best out of your performance profiler. That being said, your tool might give you a handy hint so it’s not even as if you need to do that much thinking for yourself (but you’ll still look whip sharp in front of your colleagues)…

ANTS Performance Profiler hinting that the problem may be SQL-related.

Just don’t let them get a good look at your screen!