HOWTO: Caching on the cheap

A project I’ve been working on lately is going to be featured in a Super Bowl commercial this weekend. It’s probably safe to assume that the unauthenticated landing pages will get the most traffic. There is some user generated content on these pages (read: database calls) that we want to display. This is how we cached these pages on the cheap:

Step 1: Generate the cache files:

When you run this script, it will loop over all of the PATHS, on the given BASE_URL, for the given PROTOCOLS. It will store the files in a cache directory, one level up from where the script is located. This script assumes that all paths that you defined in PATHS end in a ‘/’ and will cache a version for the full url with and without that ‘/’ (having two different files for these resources makes it easier to write our .htaccess rules next).

bin/gen_cache.sh

#!/bin/bash

PATHS="/ /mobile/ /fb/"
PROTOCOLS="http"
BASE_URL="://example.com"

#-- CD into the project directory
cd "`dirname $0`/.."
if [ ! -d "cache" ]; then
  mkdir "cache"
fi
cd "cache"

for path in $PATHS; do
  for protocol in $PROTOCOLS; do
    base="${path//\//}"
    case "$protocol" in
      "http")
        cache_suffix="_off_cached.html"
        ;;
      "https")
        cache_suffix="_on_cached.html"
        ;;
    esac
    out_file="${base}${cache_suffix}"

    #-- Fetch the resource
    curl "${protocol}${BASE_URL}${path}" > "${out_file}"
    if [ 0 -ne $? ]; then
      #-- If the page did not download correctly, do not cache it
      rm "${out_file}"
    else 
      #-- Only contiune if the page actually loaded
      #echo "" >> "${out_file}"

      #-- If the resource is a sub directory create the alternative name for it
      #   This is so we can support /my_resource AND /my_resource/
      if [ -n "${base}" ]; then
        if [ ! -d "${base}" ]; then
          mkdir "${base}"
        fi
        cp "${out_file}" "${base}/${cache_suffix}"
      fi
    fi
  done
done

Step 2: Serve the cache files (to everyone but our curl script above):

Now that we have our cache files generated, all we have to do is server them. We do however want to let our cache generating script to always access the original source, and so we are only going to apply this rule if the current browser is currently not curl. If that’s the case, we check to make sure that the cache file exits, and if it does, we serve that instead of the actual, uncached resource.

.htaccess

#-- Serve cached files to non Curl user agents
RewriteCond %{HTTP_USER_AGENT} !^curl [NC]
RewriteCond %{DOCUMENT_ROOT}/cache/%{REQUEST_URI}_%{HTTPS}_cached.html -f
RewriteRule ^(.*)$ cache/$1_%{HTTPS}_cached.html [L]

Boom! Add our bin/gen_cache.sh file to run every minute in our cronttab and we’ve got a super cheap cache!

Note: This is only for unauthenticated pages, and there are much better ways to cache resource, like varnish, but for something that you can configure in minutes and only requiring curl, mod_rewrite, and cron, this is a decent start.