Cartographer: Site Maps with No Strings Attached

2011-03-28 20:00:00 -0400


Site maps are one of the few things you can do to a website that don’t veertowards the “snake oil” side of SEO. Generally with these things, having directsupport from the Search Engine vendors is a straight arrow. In the process ofmaking our sites more accessible to customers, we noticed a void of thesewonderful little files.

The sitemap specification is so simpleit hurts. You have a list of URLs. They have a few properties. This isn’tparticularly surprising; Google, the inventor of the specification, is prettyskilled at applying KISS philosophy when it comes to web technology.

The number of sitemap tools onrubygems.org didn’tsurprise me either. After all, this is a very useful tool for people involvedin the web, and like it or not, a great deal of the energy thrown at ruby isfor web development.

That said, the ruby community loves its static site generators:theyreallylikethem.Needless to say, I was genuinely surprised when I came to the realization thatnot a single one of these sitemap systems supported them.

Enter Cartographer, which is astandalone system for generating sitemaps. In particular, it has three mainfunctions:

  • Import an existing sitemap into an internal structure to represent it
  • Find, with optional filtering, all the items in a given path and add them to to the sitemap.
  • Generate the sitemap XML.

It brings along with it a few bonuses:

  • Not attached to rails, rack, or anything other than Nokogiri, HAML, and Ruby.
  • Easy to manipulate, no need to feel locked into a framework.
  • Designed for static trees.

And a few disadvantages:

  • Does not know about your routing, rewrite rules, or any other URL manipulation.
  • Is not an “end to end”, pluggable tool. You need to invest some effort.

Cartographer makes heavy use of the ‘Find’ library that comes with Ruby, and assuch leaks its functionality via the add_tree call. For example,we use staticmatic and do the sitemap generation in an at_exit hook (so it’sthe last thing that runs, without getting into the nitty-gritty of rubyinternals.)

This Find leak is critical to the functionality of Cartographer and yields whatI believe to be very effective and succinct methods for dynamically adjustingwhat will be automatically included in your sitemap. For example, from theSTRIP Password Manager’s staticmatic generator:

You can see in the add_tree block we case over thepath with several regexes applied. Returning nil will callFind.prune which says, “Please do not look down this treefurther”. Cartographer will also refuse to include that tree into the sitemap.

So, here, we actually prune all non-html assets, and a couple of files whichunfortunately make it into our repos from time to time. We don’t want to indexour Search Engine instructions either. Also, for index.html files we prefer theparent path with a trailing slash, e.g., /demo/index.html werewrite to /demo/. If none of these criteria match, we simplyreturn what we got and it is added to the URL list verbatim.

The result looks something like this (trimmed for brevity):

Documentation for Cartographer ishere, and the code is ongithub. Please feel free to fork andadd suggestions, patches, or issues!


blog comments powered by Disqus