Site maps are one of the few things you can do to a website that don’t veertowards the “snake oil” side of SEO. Generally with these things, having directsupport from the Search Engine vendors is a straight arrow. In the process ofmaking our sites more accessible to customers, we noticed a void of thesewonderful little files.
The sitemap specification is so simpleit hurts. You have a list of URLs. They have a few properties. This isn’tparticularly surprising; Google, the inventor of the specification, is prettyskilled at applying KISS philosophy when it comes to web technology.
The number of sitemap tools onrubygems.org didn’tsurprise me either. After all, this is a very useful tool for people involvedin the web, and like it or not, a great deal of the energy thrown at ruby isfor web development.
That said, the ruby community loves its static site generators:theyreallylikethem.Needless to say, I was genuinely surprised when I came to the realization thatnot a single one of these sitemap systems supported them.
Enter Cartographer, which is astandalone system for generating sitemaps. In particular, it has three mainfunctions:
It brings along with it a few bonuses:
And a few disadvantages:
Cartographer makes heavy use of the ‘Find’ library that comes with Ruby, and assuch leaks its functionality via the
add_tree call. For example,we use staticmatic and do the sitemap generation in an at_exit hook (so it’sthe last thing that runs, without getting into the nitty-gritty of rubyinternals.)
This Find leak is critical to the functionality of Cartographer and yields whatI believe to be very effective and succinct methods for dynamically adjustingwhat will be automatically included in your sitemap. For example, from theSTRIP Password Manager’s staticmatic generator:
You can see in the
add_tree block we
case over thepath with several regexes applied. Returning
nil will call
Find.prune which says, “Please do not look down this treefurther”. Cartographer will also refuse to include that tree into the sitemap.
So, here, we actually prune all non-html assets, and a couple of files whichunfortunately make it into our repos from time to time. We don’t want to indexour Search Engine instructions either. Also, for index.html files we prefer theparent path with a trailing slash, e.g.,
/demo/index.html werewrite to
/demo/. If none of these criteria match, we simplyreturn what we got and it is added to the URL list verbatim.
The result looks something like this (trimmed for brevity):