Category Archives: Performance

Current talk list 2016: web and database performance

It’s that time of the year where, for me, talk proposals are submitted. I also tend to take it as an opportunity to refresh and rework talks.

This year I’ve submitted talks for DDD, DDD North, and NDC London (this one’s a bit of a long shot), and am keeping my eye out for other opportunities. I’ll also be giving talks at the Derbyshire .NET User Group, and DDD Nights in Cambridge in the autumn.

Voting for both DDD and DDD North is now open so, if you’re interested in any of the talks I’ve listed below, please do vote for them at the following links:

Here are my talks. If you’d like me to give any of them at a user group, meetup, or conference you run, please do get in touch.

Talk Title: How to speed up .NET and SQL Server web apps

Performance is a critical aspect of modern web applications. Recent developments in hardware, software, infrastructure, bandwidth, and connectivity have raised expectations about how the web should perform.

Increasingly this attitude is applied to internal line of business apps, and niche sites, as much as to large public-facing sites. Google even bases your search ranking in part on how well your site performs. Being slow is no longer an option.

Unfortunately, problems can occur at all layers and in all components of an application: database, back-end code, systems integrations, local and third party services, infrastructure, and even – increasingly – the client.

Complex apps often have problems in multiple areas. How do you go about tracking them down and fixing them? Where do you begin?

The answer is you deploy the right tools and techniques. The good news is that generally you can do this without changing your development process. Using a number of case studies I’m going to show you how to track down and fix performance issues. We’ll talk about the tools I used to find them, and the fixes that resulted.

That being said, prevention is better than cure, so I’ll also talk about how you can go about catching problems before they make it to production, and monitor to get earlier notification of trouble brewing.

By the end you should have a plethora of tools and techniques at your disposal that you can use in any performance analysis situation that might confront you.

Talk Title: Premature promotion produces poor performance: memory management in the CLR and JavaScript runtimes

The CLR, JVM, and well-known JavaScript runtimes provide automatic memory management with garbage collection. Developers are encouraged to write their code and forget about memory management entirely. But whilst ignorance is bliss, it can also lead to a host of problems further down the line.

With web applications becoming ever more interactive, and the meteoric rise in popularity of mobile browsers, the kind of performance and resource usage issues that once only concerned back-end developers have now become common currency on the client as well.

In this session we’ll look at how these runtimes manage memory and how you can get the best out of them. We’ll discuss the “classic” blunders that can trip you up, and how you can avoid them. We’ll also look at the tools that can help you if and when you do run into trouble, both on the client and the server.

You should come away from this session with a good understanding of managed memory, particularly as it relates to the CLR and JavaScript, and how you can write code that works with the runtimes rather than against them.

Talk Title: Optimizing client-side performance in interactive web applications

Web applications are becoming increasingly interactive. As a result, more code is shifting to the client, and JavaScript performance has become a key factor for many web applications, both on desktop and mobile. Just look at this still ongoing discussion kicked off by Jeff Atwood’s “The State of JavaScript on Android in 2015 is… poor” post: https://meta.discourse.org/t/the-state-of-javascript-on-android-in-2015-is-poor/33889/240.

Devices nowadays offer a wide variety of form factors and capabilities. On top of this, connectivity – whilst widely available across many markets – varies considerably in quality and speed. This presents a huge challenge to anyone who wants to offer a great user experience across the board, along with a need to carefully consider what actually constitutes “the board”.

In this session I’m going to show you how to optimize the client experience. We’ll take an in depth look at Chrome Dev Tools, and how the suite of debugging, data collection and diagnostic tools it provides can help you diagnose and fix performance issues on the desktop and Android mobile devices. We’ll also take a look at using Safari to analyse and debug web applications running on iOS.

Throughout I’ll use examples from https://arcade.ly to illustrate. Arcade.ly is an HTML5, JavaScript, and CSS games site. Currently it hosts a version of Star Castle, called Star Citadel, but I’m also working on versions of Asteroids (Space Rawks!), and Space Invaders (yet to find an even close to decent name). It supports both desktop and mobile play. Whilst this site hosts games the topics I cover will be relevant for any web app featuring a high level of interactivity on the client.

Talk Title: Complex objects and microORMs: an introduction to the Dapper.SimpleLoad and Dapper.SimpleSave extensions for StackExchange’s Dapper microORM

Dapper (https://github.com/StackExchange/dapper-dot-net) is a popular microORM for the .NET framework that provides simple way to map database rows to objects. It’s a great alternative when speed is of the essence, and when you just don’t need the functionality offered by EF.

But what happens when you want to do something a bit more complicated? What happens if you want to join across multiple tables into a hierarchy composed of different types of object? Well,then you can use Dapper’s multi-mapping functionality… but that can quickly turn into an awful lot of code to maintain, especially if you make heavy use of Dapper.

Step in Dapper.SimpleLoad (https://github.com/Paymentsense/Dapper.SimpleLoad), which handles the multi-mapping code for you and, if you want it to, the SQL generation as well.

So far so good, but what happens when you want to save your objects back to the database?

With Dapper it’s pretty easy to write an INSERT, UPDATE, or DELETE statement and pass in your object as the parameter source. But if you’ve got a complex object this, again, can quickly turn into a lot of code.

Step in Dapper.SimpleSave (https://github.com/Paymentsense/Dapper.SimpleSave), which you can use to save changes to complex objects without the need to worry about saving each object individually. And, again, you don’t need to write any SQL.

I’ll give you a good overview of both Dapper.SimpleLoad and Dapper.SimpleSave, with a liberal serving of examples. I’ll also explain their benefits and drawbacks, and when you might consider using them in preference to more heavyweight options, such as EF.

Is there more to life than increasing its speed? Web performance: how fast does your website need to be?

How fast does your website need to be?

Web performance is a hot button topic so that question is pretty much guaranteed to start an argument. Perhaps this is more because of the answer – which is, “it depends” – than the question. But it’s fair to say that if much of your business either arrives, or is transacted, online then the answer is pretty darned fast. (It’s also fair to say if the speed of your website is the only differentiator you have from your competitors, you may have bigger problems.)

In this post I want to cover the following:

  • The relationship between web performance and
    • Key business metrics such as retention, conversion rates, and revenue
    • Mobile computing
    • SEO
  • Ideal benchmark web performance
  • How to improve web performance

That’s obviously quite a lot of ground to cover, so let’s get cracking.

Web Performance & Key Business Metrics

It’s a couple of years old now but Tammy Everts’ excellent post on the web “performance poverty line” still rings true. You can find a more recent reworking here, although the graphs are the same.

I’m not going to rehash everything she said because there’s really no point, but is it honestly beyond the bounds of possibility that if she were to redraw the graphs for 2014 then the lines might fall something like this?

Landing page speed versus bounce rate Landing page speed versus pages per visit fall-offLanding page speed versus conversion rate fall-off

No, I don’t think so either. Nobody’s become any more tolerant of slow websites in the last two years.

It’s worth pointing out that the performance poverty line is NOT an absolute line for all websites, in contrast to the way I’ve sometimes seen it presented. Tammy took data for 5 companies that were Strangeloop customers and suggests that you should collect your own data from your own site to find where your performance poverty line is. Nevertheless, I think the line at 8 seconds is a good ballpark figure.

What it means is that for page loads over 8 seconds, relatively small improvements in performance will make little or no difference to key business metrics because you’ve already lost people. For example, you’re unlikely to see any improvement in bounce rate, pages per visit, or conversion rate if you just improve your loading time from 10 seconds to 8 seconds. You need to halve your page load time, or better, to see any real improvement.

Companies like Amazon and Facebook take this very seriously, and have hard numbers for the negative effect poor performance can have on both revenue and engagement.

In 2006 Amazon announced that revenue increased by 1% for every 100ms they were able to shave off page load times: a claim that you can find on slide 10 of their 2009 Make Data Useful presentation. Strangeloop went on to create an infographic illustrating this for Amazon, along with several other major websites:

Illustration of performance findings across different websites from Strangeloop.

(Click to see a larger version. NB. They’re happy for people to reproduce this.)

To summarise:

  • Shopzilla saw a 12% revenue increase after improving average page load times from 6 seconds to 1.2 seconds.
  • Amazon saw 1% revenue increase for every 100ms improvement.
  • AOL found visitors in the top ten percentile of site speed view 50% more pages than visitors in the bottom ten percentile.
  • Yahoo increased traffic by 9% for every 400ms improvement.
  • Mozilla estimated 60 million more Firefox downloads as a result of making page loads 2.2 seconds faster.

I also mentioned Facebook. They’re far from my favourite site, but back in 2010 at Velocity China they revealed that 500ms extra on page load times lead to a 3% drop-off in traffic, and 1000ms lead to 6% drop-off. One suspects that as page loads get slower still that nice linear relationship probably turns into a cliff drop.

And the evidence goes back even further. Remember how, in the late 90s, that search engine nobody had heard of – Google – managed to trounce all opposition? One of the major reasons for that (apart from better search results) was that the homepage was incredibly sparse, such that it loaded very quickly even over the slowest of dial-up connections. This was in stark contrast to the (relatively – remember, slow connections) bloated and content laden homepages of sites such as AltaVista and Yahoo. Here’s AltaVista’s homepage on January 17th, 1999. Ironically they were doing a better job back in 1996.

I’m not seriously suggesting that in the case of your site you’ll definitely lose 1% of revenue for every extra 100ms on page load time. Amazon has an extraordinarily broad customer base, whereas in a niche you might not suffer as badly… alternatively, you might do even worse. If you collect performance metrics from your site you should be able to figure out the real impact for yourself.

What’s true is that you’ll lose out to faster competitors. You need to be amongst the best of them; ideally you want to beat them. (Unfortunately for any business involved in some kind of online retail activity, unless you’re particularly nichey, one of your competitors probably is Amazon. This is a colossal pain in the backside because their page load times are VERY fast.)

Anyway, to summarize: a faster website leads to higher conversion rates and more revenue. Win!

(Btw, I don’t rate AdSense as an income source but, if you do, a faster site should mean higher bids, which means more money for both you and Google.)

Web Performance & Mobile Computing

I’ve touched on this briefly in my aside above but mobile devices, unless they’re being used with WiFi, are notorious for suffering slow, choppy connections. In theory this gets better with 4G, and particularly with LTE-Advanced (see my previous post). In reality bandwidth caps and contention may make the additional speed and reduced latency of 4G a moot point, so don’t bank on better performance just because the headline figures suggest it’s available.

If you expect a lot of customers to access your site from a mobile device, you should make sure you test on these devices, and make any changes necessary to give users a good experience. DON’T test exclusively on the latest greatest hardware. I realise it’s tiresome but make sure you use the kind of low-end/mid-range smartphones that are common currency. There are still plenty of iPhone 3GSs and 4s, along with a gazillion veteran and scuffed Android devices doing good service.

Web Performance & SEO

SEO’s a bit of a tricky topic, because I (sort of) don’t believe in it. I’m not saying it doesn’t work but the problem is, if overdone, it can backfire quite badly. These days it seems barely a month goes by where I don’t read about another legitimate outfit who’ve been boned by a drop in traffic as Google update their search index filters. MetaFilter springs immediately to mind just because it’s been on HN the past few days, but there are others. (That particular story is sad because it’s had such a severe effect that they’ve had to let staff go, but I digress…)

The point is that nowadays the performance of your website does have an effect on its ranking in search results. The faster your site, the higher it will rank, and vice versa. A faster site is one SEO trick that Google won’t penalize you for, so take advantage of it!

Ideal Benchmark Web Performance

This is another slightly tricky area. Some people will give you a hard figure for this as though it’s holy writ, but I don’t necessarily think that’s helpful. Also, whilst it’s important that you get landing page performance right, you shouldn’t focus on that to the exclusion of your site as a whole. If you offer people a crappy experience once they’ve got past the landing page they’re still going to bail.

You need to benchmark against competitors, ideally over a variety of connection speeds, but at the very least check how you fare against them over a low latency connection to get a good idea of baseline performance. If you need to, set up a VM on Azure or EC2 and remote desktop into it, then check speeds from there. You don’t necessarily need to be the fastest site on the web, but you want to be amongst the fastest (or better if you can) as compared to your competitors.

You can use services such as Neustar for more systematic testing under load from a variety of location. You can even use them on your competitors but I wouldn’t recommend it because they probably won’t be very happy with you, and may lawyer up.

If you really want some figures to aim at, the Amazon’s numbers aren’t a bad target:

  • <200ms time to first byte,
  • <500ms to render above the fold content,
  • <2000ms for a complete page load

(NOTE: these measurements were taken on a connection with ~5ms latency. You won’t see this performance over, for example, a home broadband connection, or 3G. The effect of a slower connection compounds on slower sites though, often because of roundtripping. You should test your site over the kinds of connections your target audience will use, and on the kinds of devices they use, especially low-end laptops, cheap tablets, mobiles with no 4G connectivity, etc.)

They actually aren’t that hard to achieve. One situation in which you may find them more of a struggle is if you’re using a CMS: optimisation could require customisation, but you’ll often find plugins that can help you. WordPress, for example, offers plenty.

You want to improve the average page load, so make sure you load test under circumstances that emulate your anticipated usage patterns. This used to be a hassle but nowadays services such as the aforementioned Neustar make it pretty straightforward.

How To Improve Web Performance

There are two key areas for improvement:

  • Time to first byte (server-side optimization)
  • Client-side processing, loading and rendering

Taking latency out of the picture, time to first byte (TTFB) is a function of how much work you have to do on the server before you start returning page data. Lots of data retrieval or dynamic generation on the server side can have a devastating effect on time to first byte. Web servers are never faster than when serving static content so you want to get as close to this as possible, particularly for landing pages.

For example, if you need to present a lot of user specific information, instead of executing half a dozen SQL queries to execute the data, consider storing a blob of JSON in a key-value store so you can quickly look it up and return it by user ID. You can even use caching and indexing software, such as Endeca, to help if you feel the complexity is warranted. Selective denormalisation of data can really improve performance. You can also offload work after the page load by asynchronously retrieving via AJAX or similar; this will improve the perceived performance of your site even if some page elements aren’t completely rendered immediately (you can often insert placeholder information to help as well).

Note that TTFB is a concept that applies to any request sent over HTTP, so it’s as applicable to any AJAX/web service requests made within your page as it is to the initial page load. Make sure you pay attention to both!

Client side performance is about minimising the payload you deliver (image sizes, CSS and JS minification, etc), and the number of requests. It’s also about minimising blocking so move JavaScript loading to the end of the page. JavaScript loading always blocks because your browser has to assume there’s code in there it might need to execute. You want to make sure that nothing slows the rendering of the above the fold portions of your pages, and moving <script> tags further down the page is one very good way of doing so.

Services such as Google PageSpeed Insights and Yahoo! YSlow can help you do this by telling you exactly what you need to optimise. Just point them at the appropriate URL, or install the extension in the case of YSlow, and set them off.

They’ll often tell you to put static resources, like images, on a CDN but this can be a mixed blessing. You might realise a bit of extra speed, but you’ll also lose out on SEO juice if people post links to these resources because they’ll be linking to files on a CDN, not on your website. (Yeah, I know, I know: I’m supposed to be uncomfortable with SEO, but you do need to give it some consideration.)

All of this is time, effort, and money so, if you’re desperate or lazy (and even if you’re neither) you can cheat…

Google PageSpeed Service claims to be able to improve website performance by 20-60%. Whether you believe that or not you lose nothing by at least giving it a go, even if you’re actively working on other optimisations.

To test it out, visit webpagetest.org and hand over the URL of one of your landing pages. It’ll queue up your test and, when it’s finished, present you with results like this:

Basic results for Google PageSpeed Service test, including video comparison.

(Sorry Autotrader, I’m not picking on you: I’ve just been looking at motorbikes recently and noticed your site could be a bit faster.)

The video comparison is kind of cool. You can see that with www.autotrader.co.uk (which I tested from Dublin, Ireland), the above the fold content on the optimized page appears much more quickly. However, there’s nothing quite like hard numbers, so I like the filmstrip comparison, and this sequence really highlights the differences in above the fold performance:

Timeline showing start of above the fold rendering at 0.6 seconds for optimized page. Timeline showing start of above the fold rendering for unoptimized page. Completion of above the fold rendering for optimized page. Completion of above the fold rendering for unoptimized page.

(You can click through the thumbnails for a larger view.)

I’ve switched to a Thumbnail Interval of 0.1 seconds, which shows that above the fold content begins to render at 0.6 seconds for the optimized version, as opposed to 2.2 seconds for unoptimized. That’s a full 1.6 second improvement, which is massive. Unfortunately it still doesn’t complete until 4.7 seconds, which isn’t great, but still better than 5.4 seconds for the original.

The total load time is only about 10% better for the optimized version – 4.9 seconds vs. 5.5 seconds – but the improvement in above the fold performance is key, because that’s what defines the user’s experience.

So how does this work? Google basically proxies your site. It sits between your server and your users, optimizes your pages and serves the optimized versions, instead of the versions on your servers. It is smart though so it will retrieve dynamic content from your servers whenever it’s needed. The only hassle is that to use it for real you’ll obviously need to update your DNS configuration.

As I say, they claim a 20%-60% improvement, but for dynamic sites you should realistically expect to achieve something at the lower end of that range. Also, what it often can’t overcome is a very poor TTFB because it’s not as if it can make your server any faster. Things will probably be a little better but if you have big problems you’re going to have to do some work yourself (or you could get in touch and hire me to do it for you!).

One surprising outcome of using PageSpeed Service is that sometimes overall page load times can increase. That might sound bad but, as I’ve already said, it’s the user experience that really counts: if above the fold render performance improves you’re still onto a winner.

Another reason you may not see the speed gains you hope for is that non-cacheable resources cannot be proxied by PageSpeed Service. For some resources you won’t be able to do anything about this, but you should make sure any resources that can be set cacheable are.

Final point on PageSpeed Service: you’re probably wondering about cost. Companies like Akamai offer similar services for serious $$$$ but, for now, the good news is that PageSpeed Service is free. Google do plan to charge for it, but they’ve said you’ll get 30 days notice before you have to start forking over cash, and can cancel within that period.

Conclusion

Hopefully it’s clear by now that a focus on performance leads to improvements in key business metrics related to both engagement and revenue. You also understand the need to consider mobile computing, and the potential for improved search ranking through higher performance. Finally you should have a pretty good idea of exactly what you’re aiming for performance-wise, and how to get there, by focussing on specific areas of improvement on both server and client.

Timing is everything in the performance tuning game: learn to choose the right metrics to hunt down bottlenecks

So much of life is about timing. Just ask David Davis. He was arrested after getting into a scuffle whilst having his hair cut:

David Davis with half a haircut in his police mugshot.

Bad timing, right?

But that’s not really the kind of timing I’m talking about. When you’re performance tuning an application an understanding of timing is crucial to success – it can reveal truth that would otherwise remain masked. In this post I want to cover three topics:

  • The different types of timing data you can collect, and the best way to use them,
  • Absolute versus relative timing measures, and
  • The effect of profiling method (instrumentation versus sampling) on the timing data you collect.

Let’s start off with the first…

Regardless of your processor architecture, operating system, or technology platform most (good) performance profiling software will use the most accurate timer supported by your hardware and OS. On x86 and x64 processors this is the Time Stamp Counter, but most other architectures have an equivalent.

From this timer it’s possible to derive a couple of important metrics of your app’s performance:

  • CPU time – that is, the amount of time the processor(s) spend executing code in threads that are part of your process ONLY – i.e., exclusive of I/O, network activity (e.g., web service or web API calls), database calls, child process execution, etc.
  • Wall clock time – the actual amount of time elapsed executing a particular piece of code, such as a method, including I/O, network activity, etc.

Different products might use slightly different terminology, or offer subtly differing flavours of these two metrics, but the underlying principles are the same. For this post I’ll show the examples using ANTS Performance Profiler but you’ll find that everything I say is also applicable to other performance tools, such as DotTrace, the Visual Studio Profiling Tools, and JProfiler, so hopefully you’ll find it useful.

The really simple sequence diagram below illustrates the differences between CPU time and wall clock time for executing a method called SomeMethod(), which we’ll assume is in a .NET app, that queries a SQL Server database.

Sequence diagram illustrating the difference between wall clock and CPU time.

The time spent actually executing code in SomeMethod() is represented by regions A and C. This is the CPU time for the method. The time spent executing code in SomeMethod() plus retrieving data from SQL Server is represented by regions A, B, and C. This represents the wall clock time – the total time elapsed whilst executing SomeMethod(). Note that, for simplicity’s sake:

  • I’ve excluded any calls SomeMethod() might make to other methods in your code, into the .NET framework class libraries, or any other .NET libraries. Were they included these would form part of the CPU time measurement because this is all code executing on the same thread within your process.
  • I’ve excluded network latency from the diagram, which would form part of the wall clock time measurement.

Most good performance profilers will allow you to switch between CPU and wall clock time. All the profilers I mentioned above support this. Here’s what the options look like in ANTS Performance Profiler; other products are similar:

Timing options in Red Gate's ANTS Performance Profiler

There’s also the issue of time in method vs. time with children. Again the terminology varies a little by product but the basics are:

  • Time in method represents the time spent executing only code within the method being profiled. It does not include callees (or child methods), or any time spent sleeping, suspended, or out of process (network, database, etc.). It follows from this that the absolute value of time in method will be the same regardless of whether you’re looking at CPU time, or wall clock time.
  • Time with children includes time spent executing all callees (or child methods). When viewing wall clock time it also includes time spent sleeping, suspended, and out of process (network, database, etc.).

OK, let’s take a look at an example. Here’s a method call with CPU time selected:

CPU times for method

And here’s the same method call with wall clock time selected:

Wall clock times for method

Note how in both cases Time (ms), which represents time in method, is the same at 0.497ms, but that with wall clock time selected the time with children is over 40 seconds as opposed to less than half a second. We’ll take a look at why that is in a minute. For now all you need to understand is that this is time spent out of process, and it’s the kind of problem that can easily be masked if you look at only CPU time.

All right, so how do you know whether to look at CPU time or wall clock time? And are there situations where you might need to use both?

Many tools will give you some form of real-time performance data as you use them to profile your apps. ANTS Performance Profiler has the timeline; other tools have a “telemetry” view, which shows performance metrics. The key is to use this, along with what you know about the app to gain clues as to where to look for trouble.

The two screengrabs above are from a real example on the ASP.NET MVC back-end support systems for a large B2B ecommerce site. They relate to the user clicking on an invoice link from the customer order page. As you’d expect this takes the user to a page containing the invoice information, but the page load was around 45 seconds, which is obviously far too long.

Here’s what the timeline for that looked like in ANTS Performance Profiler:

ANTS Performance Profiler timeline for navigating from order to invoice page on internal support site.

(Note that I’ve bookmarked such a long time period not because the profiler adds that much overhead, but because somebody asked me a question whilst I was collecting the data so there was a delay before a clicked Stop Live Bookmark!)

As you can see, there’s very little CPU activity associated with the worker process running the site; just one small spike over to the left.

This tells you straight away that the time isn’t being spent on lots of CPU intensive activity in the website code. Look at this:

Call tree viewing CPU time - doesn't look like there's much amiss.

We’re viewing CPU time and there’s nothing particularly horrendous in the call tree. Sure, there’s probably some room for optimisation, but absolutely nothing that would account for the observed 45 second page load.

Switch to wall clock time and the picture changes:

Call graph looking at wall clock time - now we're getting somewhere!

Hmm, looks like the problem might be those two SQL queries, particularly the top one! Maybe we should optimise those*.

Do you see how looking at the “wrong” timing metric masked the problem? In reality you’ll want to use both metrics to see what each can reveal and you’ll quickly get to know which works best in different scenarios as you do more performance tuning.

By the way: for those of you working with Java, JProfiler has absolutely great database support with multiple providers for different RDBMSs. I would highly recommend you check it out.

You may have noticed that throughout the above examples I’ve been looking at absolute measurements of time, in this case milliseconds. Ticks and seconds are often also available, but many tools often offer relative measurements – generally percentages – in some cases as the default unit.

I find relative values often work well when looking at CPU time but that, generally, absolute values are a better bet for wall clock time. The reason for this is pretty simple: wall clock time includes sleeping, waiting, suspension, etc., and so often your biggest “bottleneck” can appear to be a single thread that mostly sleeps, or waits for a lock (e.g., the Waiting for synchronization item in the above screenshots). This will often be something like the GC thread and the problem is, without looking at absolute values, you’ve no real idea how significant the amounts of time spent in other call stacks really are. Switching to milliseconds or (for really gross problems – the above would qualify) seconds can really help.

Let’s talk about instrumentation versus sampling profiling and the effect this has on timings.

Instrumentation is the more traditional of the two methods. It actually modifies the running code to insert extra instructions that collect timing values throughout the code. For example, instructions will be inserted at the start and end of methods and, depending upon the level of detail selected, at branch points in the code, or at points which mark the boundaries between lines in the original source. Smarter profilers need only instrument branch points to accurately calculate line level timings and will therefore impose less overhead in use.

Back in the day this modification would be carried out on the source code, and this method may still be used with C++ applications. The code is modified as part of the preprocessor step. Alternatively it can be modified after compilation but before linking.

Nowadays, with VM languages, such as those that run in the JVM or the .NET CLR, the instrumentation is generally done at runtime just before the code is JITed. This has a big advantage: you don’t need a special build of your app in order to diagnose performance problems, which can be a major headache with older systems such as Purify.

Sampling is available in more modern tools and is a much lower overhead, albeit less detailed, method of collecting performance data. The way it works is that the profiler periodically takes a snapshot of the stack trace of every thread running in the application. It’ll generally do this many times a second – often up to 1,000 times per second. It can then combine the results from the different samples to work out where most time is spent in the application.

Obviously this is only good for method level timings. Moreover methods that execute very quickly often won’t appear in the results at all, or will have somewhat skewed timings (generally on the high side) if they do. Timings for all methods are necessarily relative and any absolute timings are estimates based on the number of samples containing each stack trace relative to the overall length of the selected time period.

Furthermore most tools cannot integrate ancillary data with sampling. For example, ANTS Performance Profiler will not give information about database calls, or HTTP requests, in sampling mode since this data is collected using instrumentation, which is how it is able to tell you – for example – exactly where queries were executed.

Despite these disadvantages, because of its low overhead, and because it doesn’t require modification of app code, sampling can often be used on a live process without the need for a restart before and after profiling, so can often be a good option for apps in production.

The effect of all of this on timing measurements if you’ve opted for sampling rather than instrumentation profiling is that the choice of wall clock time or CPU time becomes irrelevant. This is because whilst your profiler knows the call stack for each thread in every sample, it probably won’t know whether or not the thread was running (i.e., it could have been sleeping, suspended, etc.) – figuring this out could introduce unacceptable overhead whilst collecting data. As a result you’ll always be looking at wall clock time with sampling, rather than have the choice as you do with instrumentation.

Hopefully you’re now equipped to better understand and use the different kinds of timing data your performance profiler will show you. Please do feel free to chime in with questions or comments below – feedback is always much appreciated and if you need help I’d love to hear from you.

*Optimising SQL is beyond the scope of this post but I will cover it, using a similar example, in the future. For now I want to focus on the different timing metrics and what they mean to help you understand how to get the best out of your performance profiler. That being said, your tool might give you a handy hint so it’s not even as if you need to do that much thinking for yourself (but you’ll still look whip sharp in front of your colleagues)…

ANTS Performance Profiler hinting that the problem may be SQL-related.

Just don’t let them get a good look at your screen!

Live Bookmarking in ANTS Performance Profiler: a killer feature to help you zero in on performance problems fast

Last week I was sat with Simon, one of my client’s managers, as he showed me around their new customer support centre web app highlighting slow-loading pages. Simon, along with a couple of others, has been playing guinea pig using the new support centre in his day to day work.

The main rollout is in a few weeks but the performance problems have to be fixed first so support team members don’t spend a lot more time on calls, forcing customers to wait longer on hold before speaking to someone. Potentially bad for costs, customer satisfaction, and team morale!

Simon gave me a list of about a dozen trouble spots and I remoted into their production box to profile them all. I had to collect the results and get off as quickly as possible to avoid too much disruption; I could analyse them later on my own laptop. This gave me plenty of time to hunt down problems and suggest fixes.

I used Red Gate’s ANTS Performance Profiler throughout. One of the many helpful features it includes is bookmarking. You can mark any arbitrary time period in your performance session, give it a meaningful name (absolutely invaluable!), and use that as a shortcut to come back to it later.

For example, here I’ve selected the “Smart search” bookmark I created whilst profiling the support centre:

Timeline with bookmarked region selected.

The call tree shows me the call stacks that executed during the bookmarked time period. Towards the bottom you can see that SQL queries are using the vast majority of time in this particular stack trace:

Call tree showing call stacks within bookmarked region on timeline.

(Identifying SQL as a problem I took these queries and analysed them in more detail using both their execution plans, and SQL Server’s own SQL Profiler. I then suggested more efficient queries that could be used by NHibernate via repository methods.)

Also note we’re looking at Wall-clock time as opposed to CPU time. I won’t talk about the differences in detail here. What you need to understand is that Wall-clock time represents actual elapsed time. This matters because the queries execute in SQL Server, outside the IIS worker process running the site. Under CPU time measurements, which only include time spent in-process, they therefore wouldn’t appear as significant contributors to overall execution time.

Back on point: bookmarking is great as far as it goes, but you have to click and drag on the timeline after the fact to create them yourself. In the midst of an involved profiling session this is a hassle and can be error prone: what if by mistake you don’t drag out the full region you need to analyse? All too easily done, and as a result you can miss something important in your subsequent analysis.

Step in Live Bookmarks.

Basically, whilst profiling, you hit a button to start the bookmark, do whatever you need to do in your app, then hit a button to save the bookmark. Then you repeat this process as many times as you need. No worries about missing anything.

Here’s how it goes in detail:

  1. Start up a profiling session in ANTS Performance Profiler.
  1. Whilst profiling, click Start Bookmark. (To the right of the timeline.)

Start a live bookmark.

  1. Perform some action in your app – in my case I was clicking links to navigate problem pages, populate data, etc.
  1. Click Stop Bookmark.

Stop (and save) a live bookmark.

  1. The bookmark is added to the list on the right hand side of the timeline. It’s generally a good idea to give your bookmark a meaningful name. To do this just click the pencil icon next on the bookmark and type in the new name.

Rename bookmark.

  1. Rinse and repeat as many times as you need.
  1. When you’ve finished, click Stop in the top-left corner of the timeline to stop profiling.

Stop profiling.

It’s a good idea to save your results for later using File > Save Profiler Results, just in case the worst happens, and of course you can analyse them offline whenever you have time.

And that’s it: nice and easy, and very helpful when it comes to in depth performance analysis across a broad range of functionality within your application.

Any questions/comments, please do post below, or feel free to get in touch with me directly.