Site maps are one of the few things you can do to a website that don’t veertowards the “snake oil” side of SEO. Generally with these things, having directsupport from the Search Engine vendors is a straight arrow. In the process ofmaking our sites more accessible to customers, we noticed a void of thesewonderful little files.
The sitemap specification is so simpleit hurts. You have a list of URLs. They have a few properties. This isn’tparticularly surprising; Google, the inventor of the specification, is prettyskilled at applying KISS philosophy when it comes to web technology.
The number of sitemap tools onrubygems.org didn’tsurprise me either. After all, this is a very useful tool for people involvedin the web, and like it or not, a great deal of the energy thrown at ruby isfor web development.
That said, the ruby community loves its static site generators:theyreallylikethem.Needless to say, I was genuinely surprised when I came to the realization thatnot a single one of these sitemap systems supported them.
Enter Cartographer, which is astandalone system for generating sitemaps. In particular, it has three mainfunctions:
- Import an existing sitemap into an internal structure to represent it
- Find, with optional filtering, all the items in a given path and add them to to the sitemap.
- Generate the sitemap XML.
It brings along with it a few bonuses:
- Not attached to rails, rack, or anything other than Nokogiri, HAML, and Ruby.
- Easy to manipulate, no need to feel locked into a framework.
- Designed for static trees.
And a few disadvantages:
- Does not know about your routing, rewrite rules, or any other URL manipulation.
- Is not an “end to end”, pluggable tool. You need to invest some effort.
Cartographer makes heavy use of the ‘Find’ library that comes with Ruby, and assuch leaks its functionality via the
add_tree call. For example,we use staticmatic and do the sitemap generation in an at_exit hook (so it’sthe last thing that runs, without getting into the nitty-gritty of rubyinternals.)
This Find leak is critical to the functionality of Cartographer and yields whatI believe to be very effective and succinct methods for dynamically adjustingwhat will be automatically included in your sitemap. For example, from theSTRIP Password Manager’s staticmatic generator:
You can see in the
add_tree block we
case over thepath with several regexes applied. Returning
nil will call
Find.prune which says, “Please do not look down this treefurther”. Cartographer will also refuse to include that tree into the sitemap.
So, here, we actually prune all non-html assets, and a couple of files whichunfortunately make it into our repos from time to time. We don’t want to indexour Search Engine instructions either. Also, for index.html files we prefer theparent path with a trailing slash, e.g.,
/demo/index.html werewrite to
/demo/. If none of these criteria match, we simplyreturn what we got and it is added to the URL list verbatim.
The result looks something like this (trimmed for brevity):
Documentation for Cartographer ishere, and the code is ongithub. Please feel free to fork andadd suggestions, patches, or issues!
Strip 1.5.0 has been released and is available now from the iTunes Store as a free update. We think this is the best release of STRIP yet. It contains myriad small fixes to enhance the user experience and correct application behavior and a few more impressive updates:
- Support for staying unlocked on multi-tasking devices
- Support for landscape-orientation for easier data entry
- Updated graphics for Retina displays
- Updated SQLCipher to version 1.1.8
The various other updates:
- In-app upgrade to unlimited (no more STRIP Lite)
- Current customers who bought STRIP before 1.5 are grandfathered
- Field behavior changes fixed to take effect immediately
- Database info screen now includes Ditto replica info
- Password Generator sets controls to last-use settings
- Last-use settings are stored encrypted in SQLCipher
- Fixed display of lower distenders in fields values
- Field labels are no longer lower-cased on display
- Fixed default icon display for newly imported entries and categories
As you can see from the second listing above, we’ve made a change to how we allow people to try STRIP for free before making a purchase. Nothing has really changed for our existing customers, who are grandfathered to ensure the app is not limited in anyway.
Strip Sync update required
This version of Strip, 1.5.0, is incompatible with earlier releases of Strip Sync. As of today, an update is available for both Strip Sync for Mac OS X and Strip Sync for Windows. If you already have Strip Sync installed, simply fire up the program and it should offer to update and relaunch. If not, or if you’d like to install the software directly yourself, you can download the updated version below:
As always, if you have any questions or run into any issues, please get in touch.
Databases and EBS: What you need to know.
Just a few things that you should know about EBS.
EBS is slow
All your data travels over a network before it reaches a disk, or data from thedisk reaches your instance. This means that writes and reads can be slow orintermittent at times.
Further compounding the issue, your SAN is shared with hundreds (thousands?) ofother users! While these machines are some high powered “big iron”, it stillmeans you’re going to have I/O contention and a number of other issues.
Even further, your disk access is metered! This means all those operations aretickling tiny little counters. This isn’t a lot in reality, but it all adds up!
On the bright side, EC2 instances have a lot of RAM. Let’s play to the field!
PostgreSQL Configuration and Use
This might sound a little preachy and redundant, but here goes:
No database should go without being properly indexed, from head to toe, witheverything you query upon and the vast majority of the combinations you use inyour queries. Yes, write performance will suffer, but we’re about to renderthat much less troublesome by sending the writes to RAM as frequently aspossible.
Less time spent searching tables = less disk access = greater performance.
Build queries to be sent over the network
If you’re doing anything with an ORM, you’re probably guilty of this at leastonce or twice: building your queries to be sent to the app to be handled later.You know those kooky DBA types that say “do everything in the database”, well,they’re on to something here.
Well, ‘lo and behold you do something like this:
When something like this:
Would have not only likely saved you a lot of computational cycles, but quite abit of network traffic is reduced, and continues to pay off as your tables growin size. This happens a lot in the rails community, unfortunately.
(Yes, I’m aware this example is a bit contrived. You could easily prepare thatquery with find() or ARel’s composition methods.)
The skinny: the less you do in the database the more you’re spending on networkresources and time to deliver your result. The database is probably working Ntimes as hard, too, to deliver your responses.
Even if it takes the “pretty” out of your code, do it in the database.
Shared Buffer Cache
Shared Buffer Cache is the meat and potatoes of PostgreSQL tuning. Increasingthis value will greatly decrease the frequency at which your data is flushed todisk. An EC2 Large Instance will happily accomodate a 4GB PostgreSQLinstallation which would be more than enough for lots of reasonably traffickedapplications.
Why is this important? The less time it spends writing to disk, or the lessfrequently it writes to disk, can mean a lot for your application’sperformance!
Database backups on the cloud
We have a few options for backing things up. As usual with redundancy, the bestoption is to… be redundant. (See what I did there?) Using a strategy thatallows us the best of both worlds.
You’ve already seen our snapshot script:
Which iterates over your volumes and maintains the last 5 backups.Here is a detailed account of the script’s function.
We use the script, amongst other things, to back up our database partitions,which are composed of the database master, the transaction log, and the backupsof the WAL.
Write Ahead Logging and Continuous Archiving for Point in Time Recoveryis a pretty sticky topic and you would do yourself well to read that whole document.
Instead of repeating it here verbatim, I’ll tell you what our backup script does:
This script manages the archiving of three tarballs:
- base.tar.bz2, the base database system
- full-wal.tar.bz2, the whole WAL for the last day.
- pit-wal.tar.bz2, the point in time portion of the WAL.
The major difference between ‘full-wal’ and ‘pit-wal’ is that at the time thefirst backup is taken (the night of the backup), the data may not be fullycommitted to disk. Therefore, we write as much as we can to the ‘pit-wal’ filefor the purposes of crashes that day. The ‘full-wal’, as you might suspect, isthe fully written representation and is actually written out a day after thebackup occurred.
In a recovery scenario, both of these tarballs would be merged with theexisting WAL files in order of ‘pit-wal’, then ‘full-wal’ would be unpacked.
The WAL directory itself has some data hidden in the filenames, let’s checkthat out:
2011-02-01 09:05 000000030000000300000026
2011-02-01 09:05 000000030000000300000026.000076B8.backup
2011-02-01 10:12 000000030000000300000027
2011-02-01 11:30 000000030000000300000028
2011-02-01 12:57 000000030000000300000029
2011-02-01 14:10 00000003000000030000002A
2011-02-01 14:58 00000003000000030000002B
2011-02-01 15:30 00000003000000030000002C
The filenames themselves hold two important pieces of information:
- The first 8 characters of the filename are the recovery version. As we’re good little children and test our backups, this is at version 3.
- The last 8 characters of the filename are ordered, you can see this by comparing the times and the filenames themselves.
- If there is an extension, that is a demarcation point where pg_start_backup()/pg_stop_backup() was invoked. This is what we use to create the ‘full-wal’ tarball.
As for the backup structure? Well, here’s a sneak peek:
2011-01-28 09:05 2011-01-27.09:00:01/
2011-01-29 09:05 2011-01-28.09:00:01/
2011-01-30 09:05 2011-01-29.09:00:01/
2011-01-31 09:05 2011-01-30.09:00:01/
2011-02-01 09:05 2011-01-31.09:00:01/
2011-02-01 09:07 2011-02-01.09:00:01/
$yesterday calls just generate these filenames. At the endof the script, we see this idiom:
ls -1d * | sort -rn | tail -n +15 | xargs rm -vr
Which is a way of saying, “show us the last 15 dirs and delete the rest”. Thiskeeps our filesystem size low and we
rsync these files nightly.
sed usage here is a little tricky but not anything incomprehensible. Basically,
breakpoint=`ls *.backup | sort -r | head -n1 | sed -e 's/\..*$//'
Finds the latest backup file. Now,
arline=`ls | sort | sed -ne "/^$breakpoint$/ =" `
archive=`ls | sort | head -n $arline`
Uses that as a demarcation point to determine the archive files. Those filesare archived and removed and result in
full-wal. The rest leftover result in
EBS — or the Amazon Elastic Block Store — is the way you get persistence on most EC2 instances. Let's talk about what EBS is good for, what it's not good for, and why it matters to the EC2 consumer.
EBS is basically a volume system with dynamic attachment. You go into the EC2 system, select an EBS "volume", and attach it to an instance. There are additional ways, such as EBS rooting, to use EBS volumes.
EBS is implemented at Amazon via a Storage Area Network (SAN) that is dynamically attached to your instances. Each EBS "volume" you attach is a portion of the disks that make up the SAN; a portion that can and will be allocated sparsely.
This has performance drawbacks. EBS can be very slow and unresponsive at points (there are no availability guarantees on EC2 for any of its products), so it's important that your EBS-related task can handle intermittent outages even if very small. For the most part, things that need read performance or will block on writes will suffer. There are ways you can mitigate it, such as "striping" volumes, but in practice this is very troublesome.
What's the difference between EBS root and the instance store?
EBS rooting is where the device that your root-level filesystem lives on is an EBS volume. This is different because instance stores are ethereal, and will disappear after the machine is stopped. Therefore, it is wise to use EBS rooted machines for machines you want to last.
What's the difference between an EBS volume and a traditional physical disk?
In particular, the major (other than the provisioning, of course) issue is that EBS volumes will not necessarily be available at boot time, so you must mitigate that for any non-root volumes.
How does Zetetic use EBS?
We use it for two roles:
- Database Servers (with a high ram setting)
- Support machines (monitoring, repositories, wiki, ticket tracker, etc)
We'll talk about the database management in the next article; our support machines are very simple in execution but require a lot of configuration, so automating them is a bit of a bear.
Why not run everything as EBS root?
You could do that, but EBS is billed on a per-transaction (writes and reads) basis, and then there is the performance issue. EBS-rooted volumes additionally have a reboot penalty which (at the time of writing) is a dollar. That can get expensive quick! It's probably best to stick to EBS rooting where configuration management is hard and leave the rest to instance stores.
Hopefully this article has been a decent overview of EBS; next time we will cover PostgreSQL management in the cloud!