Coding for Failure

2008-07-24 20:00:00 -0400


Keyboard Fail

We all love mash-ups, right? Especially us developers, builders of fine web tools. When we build useful web applications I think we all tend to want to provide integration hooks to other services because our users will get more functionality (in many cases they get more bang for their buck, so to speak) and because it’s kinda cool! Nothing wrong with that, gives you something to get in touch with your users about, sometimes gets you a bit of a press, too.

But mash-ups aren’t all fun and games, they require some careful planning and hard work, even if your current system is well designed with low-coupling and a good MVC model. I saw this post by Hampton over at Unspace and got to thinking that I ought to do a little musing on coding for failure and discuss some of the techniques we’ve used in our services.

When you run a reminder service like PingMe, where your users trust you to deliver their messages without fail and on time, you have to step up your game when it comes to implementing a robust system. When you then integrate your app with an external service like Twitter to provide your users with a useful and cheap SMS/text messaging interface, you have to consider the reliability of that external service and code for failure.

Now, on some level there’s only so much failure you can prevent. Mail systems and domains can go dark, e-mail to sms gateways can blink out, there’s not much you can do about it beyond picking a good MTA and spending a solid amount of time configuring it properly. (We highly recommend Exim, which is the most flexible one out there with great documentation and a strong user/development community.)

The great thing about serious business mail servers like Exim is that they have been very good at handling failure, retrying, and eventually giving up for a very long time, and negotiate this process with other mail servers over a long-established protocols. So if we send a message to your_phone_number@vtext.com (Verizon Wireless’s email-to-sms gateway), and the vtext.com MTA is temporarily unavailable, Exim will try again. And again. And again. And then give up. And our PingMe messaging dispatchers never have to worry about this. The E-mail and SMS handlers simply turn the messages over to Exim on time and wash their hands of the matter.

While most of PingMe’s outbound messages are delivered via e-mail, a large portion go out over Twitter. Without beating a dead horse, and while acknowledging that their reliability has improved quite a bit, Twitter is not like our local MTA, it’s just not as reliable and as a remote HTTP service, not nearly as fast. On the other hand, once in a while our MTA might be down (perhaps I bork the config file and it doesn’t come back from a restart). More importantly, there is no mechanism in place for handling failure. When you send a message to the Twitter API, it either works out or you get a failure. And if you don’t handle that failure, you fail, too!

We handled this by implementing a retry system for our dispatchers. We caused a number of exceptions to bubble up in our test environment, everything from inability to connect to twitter to no network at all, and began catching the exceptions and wrapping them as DeliveryExceptions. If Twitter (or our MTA) is down, the message instance is delayed by a few minutes and marked for retry. We’ll retry numerous times before giving up (there comes a point at which a time-based message loses its relevance…).

Just a little peaking into our messaging code:

rescue DeliveryException => e
@log.error "Caught delivery exception, marking event for retry."
retry_event(event)
...
def retry_event(event)
event.status = Event::STATUS_RETRY
event.retry_count += 1 # up the retry count
event.retry_at = event.dt_when + (5.minutes * event.retry_count)
...
def lock_a_block(type_name)
before = (Time.now.utc).to_s(:db)

ActiveRecord::Base.connection.execute(
<<-END_OF_SQL
UPDATE events SET dispatcher = '#{@name}'
WHERE id IN (
SELECT e.id FROM
(( events e INNER JOIN targets t ON e.target_id = t.id )
INNER JOIN pings p ON e.ping_id = p.id)
INNER JOIN target_types tt ON t.target_type_id = tt.id
WHERE
tt.const = '#{type_name}'
AND
(
(e.dt_when < '#{before}' AND e.status = '#{Event::STATUS_PENDING}')
OR
(e.retry_at < '#{before}' AND e.status = '#{Event::STATUS_RETRY}')
)
...

The code actually gets quite a bit more complicated than that, and I don’t really want to go fully dissecting the polymorphic message handlers we’ve written, but it shows you how handling failure isn’t really an outlier problem, it becomes core to your system. It’s just as important as returning those nice model validation errors that Rails makes so convenient for you.

Another technique we use in PingMe is pipeline prevention. Well, that’s what I call it. But basically you can’t have one Twitter-bound ping holding up every other outbound ping at 5pm EST! We spent a lot of time implementing a system that allows for many concurrent dispatcher daemons, and all Twitter-bound pings go through only two of them, preventing the others from being affected by the high latency when connecting to Twitter. We ended up using the mutex pattern with Postgres:

  def acquire_mutex
ActiveRecord::Base.connection.execute(
<<-END_OF_SQL
LOCK mutex IN ACCESS EXCLUSIVE MODE;
END_OF_SQL
)
end

In our time-tracking app Tempo, we allow users to send time entries and start timers by sending messages to our Twitter account (twitter.com/keeptempo), and we have a daemon checking the API for new direct messages every couple of minutes.

Two things have to happen for that to work over direct messaging – both accounts have to be “following” each other. So the user follows us on Twitter, then enters their Twitter ID on their Tempo profile. Tempo does a quick check to make sure you’re following ‘keeptempo’, and then attempts to follow you. Either of those connections to the Twitter API can and often do fail.

So what do we do? We put together a rake task that generates a list of twitter ids on our user’s profiles that we aren’t following, and sends a follow request for each of them. We run it as a periodically and it catches quite a few. Not perfect, but just about the best we can do. It’s better than letting users walk away thinking that it doesn’t work at all! In that case you just look bad, and it’s not even your fault!

But it is your fault, actually, because you have to code for failure, or you look pretty bad when the exceptions bubble up to the surface, literally. Or, worse, you present the user with inaccurate information based on an exception state you didn’t plan for, which can really put you in a bad light.

I stay positive, but I code for failure ;-)


blog comments powered by Disqus