HTTP caching (by )

Yesterday I configured Squid on my internal network; machines on the office LAN can use it if configured to use an HTTP proxy, while machines on the wifi LAN are forced to use it as a transparent proxy via port forwarding on the router (I'm slowly making the wifi LAN more and more like a cheap ISP's network - it's an open wifi, so I'm keen to force its users to be well-behaved).

The thing is, watching Squid's logs, I was horrified at just how few pages it felt it could cache. I'd always imagined, when developing Web apps, that anything fetched with GET could be cached for a while (and might even be prefetched). So when I actually dug a little deeper, I found that just about anything dynamically generated (including quite static pages that just use a bit of PHP to automatically include the same navigation in every page, with the currently selected option highlighted, for example) is, unless the script author has made special effort, generally not cacheable.

You can check the cacheability of pages with this useful cacheability testing tool.

Blog software is terrible at this, for example, despite generally having very cacheable pages

Rather than explain the rules in detail here, I'll link to somebody who already has. In particular, read the section on writing cache-aware scripts.

As a quick fix for many Web apps, you can use Apache's mod_expires to add cache control headers to pages. This lets you tag pages with a maximum cache period, meaning that the page may be cached for the specified period after it is fetched; on this very blog, I have modified my .htaccess file to read:

  <ifmodule mod_expires.c>
  ExpiresActive On
  ExpiresDefault "access plus 1 hour"
  ExpiresByType image/jpeg "access plus 1 week"
  ExpiresByType image/gif "access plus 1 week"
  ExpiresByType image/png "access plus 1 week"
  </ifmodule>

That marks pages as being cacheable for up to an hour - meaning that some people might not see a change to the blog for up to an hour after it was made. But it's not a great solution; as well as expiry-time based cache headers, HTTP also supports validation, whereby a page can emit a Last-Modified and/or ETag header, which both change when the underlying content changes; then clients can check to see if there's a new version available with a request containing an If-Modified-Since header. This means that when a page is requested again, the browser or proxy can quickly check to see if their cached copy is still up to date, and use it if it is.

Ideally, you want to combine validation with expiry caching; Apache will, by default, generate validation headers for any static files (such as the jpegs, gifs, and pngs I explicitly mention in my .htaccess) but will not put out any expiry headers. By adding the expiry headers above, I specify that I'm definitely not going to be changing my images very often (sure, I may add new ones...), so clients can cache them for up to a week before even bothering to validate them.

Now, WordPress generates both a Last-Modified and an ETag on its RSS feeds; when I get a moment, I'm going to figure out why it doesn't do this for the HTML pages, too, and see if I can fix it... Perhaps I could also write a WordPress plugin that lets the user set up expiry caching headers for their pages, too, since even working out the last modified date of a page takes a fair amount of work for WordPress (it has to consult the database to see when the most recent post or comment was, in particular), so if you can put up with people not seeing changes the very moment they're made, having expiry headers can save a lot of server load.

However, it strikes me that tools like PHP really need better support for HTTP efficiency. Automatic generation of ETag, and Content-Length headers would be a good start; this could be performed by spooling the output of the PHP page to memory or a file, generating a hash and counting the bytes as it does so it can then output the ETag and Content-Length. However, when a validation request based on the ETag comes in, the PHP script would still need to be run to see if the output has changed since, and if it has not, all you're saving is network bandwidth rather than server load running the script.

Generating a Last-Modified header automatically is a bit trickier.

One approach would be to keep a little database of URLs (for the same PHP script can be involved in many different URLs, by having path info or query params) alongside the PHP script (perhaps a GDBM file), and for each, storing the resulting ETag and the date/time at which it last changed. On every request it can then compare the resulting ETag to what's in the database, and if it's the same, use the Last-Modified time stored in the database; if not, update the database to the present date and then use that for the Last-Modified. However, this still suffers from the problem that validation requests will still require the script to run, and a validation match will merely save network resources. Which is better than nothing, but as the next level of support, one could add a library function, to be called near the top of a page and given a last modification timestamp, an ETag, or both. If just given a timestamp it would generate an ETag from the timestamp (by hashing it); if just given an ETag, it would generate a last-modified timestamp by comparing the ETag to one stored in a database. If the request was a validation request and it matched, then the function would send back a response saying so and terminate script execution immediately, avoiding wasting any time; otherwise, it would just output the validation headers. The function might as well take an optional expiry duration parameter, too, and output Expires and Cache-Control headers if present.

Now, cache-aware PHP scripts would be written to load as little infrastructure as possible to get access to their database, consult the database to find the last modification timestamp of the data required to generate this page, and then call the library function. Something like:

  <?php
  include "db.php";
  $data = get_data_record ($_GET["id"]);]
  // provide a last modified timestamp, but not etag
  // suggest a maximum cache age of 3600 seconds (one hour)
  // for the Cache-Control header, add the must-revalidate flag.
  cache_control ($data['changed'],false,3600,'must-revalidate');
  include "complex-view-logic.php";
  include "user-plugins.php";
  ...

This is something to add to my proposed feature list for a web application framework... along with automatic extraction of last modified timestamps when using automatic loading of database objects from parameters declared to be object IDs, which would totally automate cache control in the most common cases.

2 Comments

  • By Ben, Wed 24th Jan 2007 @ 12:21 pm

    Quick fix thingy... how about declaring that a script's output depends on the state of database tables x, y, and z. Add triggers to work out when the last changes were. Use for last modified and working out whether to rerun the script.

  • By @ndy, Thu 25th Jan 2007 @ 11:28 am

    Hmmm... If you return a cookie then you automatically become non cacheable (sp?). This means that you can't use one in the backend to work out when the user last requested the page. It's also really hard to track it via source IP & timestamp logging. In the stoictv.com database I keep an mtime for each item in the DB but, without knowing when your user last visited, you can't accurately use it for very much. My initial hunch was that you needed to know the users "last seen ETAG" in order to get the most efficient use of server resources. Now, if you do know it then, if the spec permits, you could send the new page all in a single response when their ETag is old; that saves you a little network traffic. If you don't then you just end up with a maximum of two requests. However, like Ben implies, if you have a table consisting of ETag, object IDs and mtimes, you can compare those to the actual object mtimes and return the stored ETag if you have a match. With clever use of indexing and/or triggers you can make that a really cheap operation. If the client is happy then they will go away and you won't hear from them again. You could even store your own, pre-rendered version of the page in your ETag table. Then, if you get a "cache miss" you can fob the client off and you can be pretty sure that it'll be coming back quite soon so you can immediately start rendering the latest version of the page into your cache table. So, I guess, by having a "front end" that models their cache, you only have to render your pages once and then serve everything out of your cache table. The stoictv.com engine doesn't currently use the mtimes to do this but I might try it out soon...

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales