Blog | Zetetic

Recently I met Alexis Rondeau, one of the two clever fellows that created the site Semapedia.org. The site allows you to create 2D barcodes, called Semapedia Tags, that link to information on Wikipedia. The idea is that you print a tag that links to information about a place or thing, then you stick that tag on the place or thing. Anyone with a 2D scanner program on their phone can then lookup the information at the site when they see the tag. Pretty cool! (I’m still trying to get the program for my Treo installed correctly.)

Anyway, I saw this on their blog and I had to share:

so you are on a bus stop and there is a barcode to download the bus schedule. Great Idea, not. The poster with the barcode takes a whole side of the bus stop, why not just print the time table, how often does that change? Why pay for anything like that. If the service behind the barcode would tell you exactly in realtime where the bus currently is located or tell you if any of your friends are on that bus, then we have something a printed time table cannot provide and is clearly more attractive. Haven’t seen any of the other ideas, but for starters, detect needs, find out what current medias don’t provide and so on.

Indeed! Always go for the simplest solution.

On why not videomail

Warren Ellis wonders aloud why “videomail, in these broadband days of ours, has never made a bigger dent. Why I don’t get videomail in my inbox along with email.”

I’d say it’s because most of us see a camera pointed at us and we feel the need to act. It’s rather difficult for a lot of people to “be natural” when they are being filmed.

When most folks seem to be looking for ever faster, ever more seamless (and often literally asynchronous) communication, the last thing they want to do is “have to act” for a two minute videomail.

Earthlink Spam Blocker

Listen, we need to talk. It’s about your Earthlink spam blocker. The one that does this:

I apologize for this automatic reply to your email.

To control spam, I now allow incoming messages only from senders I have approved beforehand.

If you would like to be added to my list of approved senders, please fill out the short request form (see link below). Once I approve you, I will receive your original message in my inbox. You do not need to resend your message. I apologize for this one-time inconvenience.

Click the link below to fill out the request:

I realize that this has its uses and I’m sure it cuts down on the amount of spam you get.

But, there are better ways. Much, much better ways. And if you’ve been wondering why you never get your activation e-mails for things that you sign up for, that’s pretty much why. You wouldn’t believe how many of these responses we get every day for PingMe and Tempo. We always try to do you the favor and click the links when we have time, but lately there are so many (now requiring us to fill out a form), that we can’t really keep up.

Do everyone a favor, turn that thing off. Cut down on the amount of mail flying around the internet.

Coding for Failure

Keyboard Fail

We all love mash-ups, right? Especially us developers, builders of fine web tools. When we build useful web applications I think we all tend to want to provide integration hooks to other services because our users will get more functionality (in many cases they get more bang for their buck, so to speak) and because it’s kinda cool! Nothing wrong with that, gives you something to get in touch with your users about, sometimes gets you a bit of a press, too.

But mash-ups aren’t all fun and games, they require some careful planning and hard work, even if your current system is well designed with low-coupling and a good MVC model. I saw this post by Hampton over at Unspace and got to thinking that I ought to do a little musing on coding for failure and discuss some of the techniques we’ve used in our services.

When you run a reminder service like PingMe, where your users trust you to deliver their messages without fail and on time, you have to step up your game when it comes to implementing a robust system. When you then integrate your app with an external service like Twitter to provide your users with a useful and cheap SMS/text messaging interface, you have to consider the reliability of that external service and code for failure.

Now, on some level there’s only so much failure you can prevent. Mail systems and domains can go dark, e-mail to sms gateways can blink out, there’s not much you can do about it beyond picking a good MTA and spending a solid amount of time configuring it properly. (We highly recommend Exim, which is the most flexible one out there with great documentation and a strong user/development community.)

The great thing about serious business mail servers like Exim is that they have been very good at handling failure, retrying, and eventually giving up for a very long time, and negotiate this process with other mail servers over a long-established protocols. So if we send a message to your_phone_number@vtext.com (Verizon Wireless’s email-to-sms gateway), and the vtext.com MTA is temporarily unavailable, Exim will try again. And again. And again. And then give up. And our PingMe messaging dispatchers never have to worry about this. The E-mail and SMS handlers simply turn the messages over to Exim on time and wash their hands of the matter.

While most of PingMe’s outbound messages are delivered via e-mail, a large portion go out over Twitter. Without beating a dead horse, and while acknowledging that their reliability has improved quite a bit, Twitter is not like our local MTA, it’s just not as reliable and as a remote HTTP service, not nearly as fast. On the other hand, once in a while our MTA might be down (perhaps I bork the config file and it doesn’t come back from a restart). More importantly, there is no mechanism in place for handling failure. When you send a message to the Twitter API, it either works out or you get a failure. And if you don’t handle that failure, you fail, too!

We handled this by implementing a retry system for our dispatchers. We caused a number of exceptions to bubble up in our test environment, everything from inability to connect to twitter to no network at all, and began catching the exceptions and wrapping them as DeliveryExceptions. If Twitter (or our MTA) is down, the message instance is delayed by a few minutes and marked for retry. We’ll retry numerous times before giving up (there comes a point at which a time-based message loses its relevance…).

Just a little peaking into our messaging code:

rescue DeliveryException => e
@log.error "Caught delivery exception, marking event for retry."
retry_event(event)
...
def retry_event(event)
      event.status = Event::STATUS_RETRY
      event.retry_count += 1 # up the retry count
      event.retry_at = event.dt_when + (5.minutes * event.retry_count)
...
  def lock_a_block(type_name)
    before = (Time.now.utc).to_s(:db)
    
    ActiveRecord::Base.connection.execute(
    <<-END_OF_SQL
      UPDATE events SET dispatcher = '#{@name}'
      WHERE id IN (
        SELECT e.id FROM 
          (( events e INNER JOIN targets t ON e.target_id = t.id )
          INNER JOIN pings p ON e.ping_id = p.id)
          INNER JOIN target_types tt ON t.target_type_id = tt.id
        WHERE 
          tt.const = '#{type_name}'
          AND 
          (
            (e.dt_when < '#{before}' AND e.status = '#{Event::STATUS_PENDING}')
            OR
            (e.retry_at < '#{before}' AND e.status = '#{Event::STATUS_RETRY}')
          )
          ...

The code actually gets quite a bit more complicated than that, and I don’t really want to go fully dissecting the polymorphic message handlers we’ve written, but it shows you how handling failure isn’t really an outlier problem, it becomes core to your system. It’s just as important as returning those nice model validation errors that Rails makes so convenient for you.

Another technique we use in PingMe is pipeline prevention. Well, that’s what I call it. But basically you can’t have one Twitter-bound ping holding up every other outbound ping at 5pm EST! We spent a lot of time implementing a system that allows for many concurrent dispatcher daemons, and all Twitter-bound pings go through only two of them, preventing the others from being affected by the high latency when connecting to Twitter. We ended up using the mutex pattern with Postgres:

  def acquire_mutex
    ActiveRecord::Base.connection.execute(
    <<-END_OF_SQL
      LOCK mutex IN ACCESS EXCLUSIVE MODE;
    END_OF_SQL
    )
  end

In our time-tracking app Tempo, we allow users to send time entries and start timers by sending messages to our Twitter account (twitter.com/keeptempo), and we have a daemon checking the API for new direct messages every couple of minutes.

Two things have to happen for that to work over direct messaging – both accounts have to be “following” each other. So the user follows us on Twitter, then enters their Twitter ID on their Tempo profile. Tempo does a quick check to make sure you’re following ‘keeptempo’, and then attempts to follow you. Either of those connections to the Twitter API can and often do fail.

So what do we do? We put together a rake task that generates a list of twitter ids on our user’s profiles that we aren’t following, and sends a follow request for each of them. We run it as a periodically and it catches quite a few. Not perfect, but just about the best we can do. It’s better than letting users walk away thinking that it doesn’t work at all! In that case you just look bad, and it’s not even your fault!

But it is your fault, actually, because you have to code for failure, or you look pretty bad when the exceptions bubble up to the surface, literally. Or, worse, you present the user with inaccurate information based on an exception state you didn’t plan for, which can really put you in a bad light.

I stay positive, but I code for failure ;-)

@rubyfringe - when programmers don't like it

Hampton Catlin’s on, he’s talking about replacing Javascript, basically. Well, there’s more to it than that. But he says, “I find when programmers think an idea is a really bad idea and they can’t give you a good ****ing reason for it, then it’s a good idea!”

Guilty as charged.

2D Barcodes and Semapedia.org

On why not videomail

Earthlink Spam Blocker

Coding for Failure

@rubyfringe - when programmers don't like it

About Zetetic

Recent Posts

Search Other Posts