Coding for Failure

2008-07-24 20:00:00 -0400


We all love mash-ups, right? Especially us developers, builders of fine web tools. When we build useful web applications I think we all tend to want to provide integration hooks to other services because our users will get more functionality (in many cases they get more bang for their buck, so to speak) and because it’s kinda cool! Nothing wrong with that, gives you something to get in touch with your users about, sometimes gets you a bit of a press, too.

But mash-ups aren’t all fun and games, they require some careful planning and hard work, even if your current system is well designed with low-coupling and a good MVC model. I saw this post by Hampton over at Unspace and got to thinking that I ought to do a little musing on coding for failure and discuss some of the techniques we’ve used in our services.

When you run a reminder service like PingMe, where your users trust you to deliver their messages without fail and on time, you have to step up your game when it comes to implementing a robust system. When you then integrate your app with an external service like Twitter to provide your users with a useful and cheap SMS/text messaging interface, you have to consider the reliability of that external service and code for failure.

Now, on some level there’s only so much failure you can prevent. Mail systems and domains can go dark, e-mail to sms gateways can blink out, there’s not much you can do about it beyond picking a good MTA and spending a solid amount of time configuring it properly. (We highly recommend Exim, which is the most flexible one out there with great documentation and a strong user/development community.)

The great thing about serious business mail servers like Exim is that they have been very good at handling failure, retrying, and eventually giving up for a very long time, and negotiate this process with other mail servers over a long-established protocols. So if we send a message to your_phone_number@vtext.com (Verizon Wireless’s email-to-sms gateway), and the vtext.com MTA is temporarily unavailable, Exim will try again. And again. And again. And then give up. And our PingMe messaging dispatchers never have to worry about this. The E-mail and SMS handlers simply turn the messages over to Exim on time and wash their hands of the matter.

While most of PingMe’s outbound messages are delivered via e-mail, a large portion go out over Twitter. Without beating a dead horse, and while acknowledging that their reliability has improved quite a bit, Twitter is not like our local MTA, it’s just not as reliable and as a remote HTTP service, not nearly as fast. On the other hand, once in a while our MTA might be down (perhaps I bork the config file and it doesn’t come back from a restart). More importantly, there is no mechanism in place for handling failure. When you send a message to the Twitter API, it either works out or you get a failure. And if you don’t handle that failure, you fail, too!

We handled this by implementing a retry system for our dispatchers. We caused a number of exceptions to bubble up in our test environment, everything from inability to connect to twitter to no network at all, and began catching the exceptions and wrapping them as DeliveryExceptions. If Twitter (or our MTA) is down, the message instance is delayed by a few minutes and marked for retry. We’ll retry numerous times before giving up (there comes a point at which a time-based message loses its relevance…).

Just a little peaking into our messaging code:

rescue DeliveryException => e
@log.error "Caught delivery exception, marking event for retry."
retry_event(event)
...
def retry_event(event)
event.status = Event::STATUS_RETRY
event.retry_count += 1 # up the retry count
event.retry_at = event.dt_when + (5.minutes * event.retry_count)
...
def lock_a_block(type_name)
before = (Time.now.utc).to_s(:db)

ActiveRecord::Base.connection.execute(
<<-END_OF_SQL
UPDATE events SET dispatcher = '#{@name}'
WHERE id IN (
SELECT e.id FROM
(( events e INNER JOIN targets t ON e.target_id = t.id )
INNER JOIN pings p ON e.ping_id = p.id)
INNER JOIN target_types tt ON t.target_type_id = tt.id
WHERE
tt.const = '#{type_name}'
AND
(
(e.dt_when < '#{before}' AND e.status = '#{Event::STATUS_PENDING}')
OR
(e.retry_at < '#{before}' AND e.status = '#{Event::STATUS_RETRY}')
)
...

The code actually gets quite a bit more complicated than that, and I don’t really want to go fully dissecting the polymorphic message handlers we’ve written, but it shows you how handling failure isn’t really an outlier problem, it becomes core to your system. It’s just as important as returning those nice model validation errors that Rails makes so convenient for you.

Another technique we use in PingMe is pipeline prevention. Well, that’s what I call it. But basically you can’t have one Twitter-bound ping holding up every other outbound ping at 5pm EST! We spent a lot of time implementing a system that allows for many concurrent dispatcher daemons, and all Twitter-bound pings go through only two of them, preventing the others from being affected by the high latency when connecting to Twitter. We ended up using the mutex pattern with Postgres:

  def acquire_mutex
ActiveRecord::Base.connection.execute(
<<-END_OF_SQL
LOCK mutex IN ACCESS EXCLUSIVE MODE;
END_OF_SQL
)
end

In our time-tracking app Tempo, we allow users to send time entries and start timers by sending messages to our Twitter account (twitter.com/keeptempo), and we have a daemon checking the API for new direct messages every couple of minutes.

Two things have to happen for that to work over direct messaging – both accounts have to be “following” each other. So the user follows us on Twitter, then enters their Twitter ID on their Tempo profile. Tempo does a quick check to make sure you’re following ‘keeptempo’, and then attempts to follow you. Either of those connections to the Twitter API can and often do fail.

So what do we do? We put together a rake task that generates a list of twitter ids on our user’s profiles that we aren’t following, and sends a follow request for each of them. We run it as a periodically and it catches quite a few. Not perfect, but just about the best we can do. It’s better than letting users walk away thinking that it doesn’t work at all! In that case you just look bad, and it’s not even your fault!

But it is your fault, actually, because you have to code for failure, or you look pretty bad when the exceptions bubble up to the surface, literally. Or, worse, you present the user with inaccurate information based on an exception state you didn’t plan for, which can really put you in a bad light.

I stay positive, but I code for failure ;-)


@rubyfringe - when programmers don't like it

2008-07-19 20:00:00 -0400


Hampton Catlin’s on, he’s talking about replacing Javascript, basically. Well, there’s more to it than that. But he says, “I find when programmers think an idea is a really bad idea and they can’t give you a good ****ing reason for it, then it’s a good idea!”

Guilty as charged.


@rubyfringe - Giles for the win

2008-07-19 20:00:00 -0400


Giles is up there doing the Archeopteryx Ruby MIDI generator talk. Best quote so far: “most people don’t use the freedom they have.”

I can’t describe to you how rapid, funny, inspiring and informative his talk is. Maybe I can: we’re all starving for dim sum lunch and are willingly sticking it out to the end of his talk. It’s that awesome.


@rubyfringe - Tiger in the cage

2008-07-18 20:00:00 -0400


Zed Shaw is up next. He is pacing. He told me he’s especially looking forward to angering the musicians in the audience. And he has a harmonica. Lord help us.


@rubyfringe - Owning what you do

2008-07-18 20:00:00 -0400


A major theme, concept, or understanding here at “Ruby Fringe”http://www.rubyfringe.com amongst folks seems to be owning what you do. Not that it’s expressed this way – but I think a lot of what we’re discussing here comes down to owning your work, owning how you spend your time, owning your mistakes and growing from there. From Fail Camp to discussion on sales and business relationships to networking and how you spend your time, to the slogan on the back of the conference tags, “Ask me about my startup!”, this is about owning what you do and consciously running with it.

Quite frankly, it’s exhilarating. This is not a normal programming conference.