Over the last few months we’ve put a lot of work into PingMe‘s scheduling system because it was necessary if we wanted to expand the service and make it more reliable. Before I get into what we are doing differently I’ll take a moment to describe the previous situation and our setup.
PingMe has a number of daemons – independent processes that are always running, scheduling pings, sending them out, and processing messages that you send to the service. These daemons are implemented in Ruby using Ruby On Rails. This allows them to be tightly coupled with the PingMe web application – the daemon processes and the webapp operate with the same model, which helps us keep the code pretty clean.
The ones of most concern are the dispatchers, the daemons who’s job it was to check for new pings to deliver, and then reschedule them for the next delivery (if necessary). Getting the concurrency right was rather tricky and involved some real nerding out in Postgres (our database engine of choice). Basically, the dispatchers had to do what’s called mutex locking in order to guarantee that different dispatchers would try to send out the same message. The locking code is a neat trick, btw, and it’s still in use, it’s served us well:
LOCK mutex IN ACCESS EXCLUSIVE MODE;
Different database engines have different facilities for this sort of thing, but basically doing this within a transaction caused the other dispatchers to wait until the lock was released. What were they waiting for? A chance to grab a block of pings to dispatch.
Now the rescheduling of pings, and the scheduling of pings was honestly a not very clean thing to begin with. We had callbacks on the Ping model that would create the actual instances of an outbound message for delivery (we called these Events), and then the dispatchers would need to block those callbacks in certain situations to cause a reschedule. It worked, I don’t want to get into the details of it, but it had one particular problem:
Events are an instance of a Ping associated with a Target for delivery. The one dispatcher we were running would do it’s selection of events to deliver based on target-types. Once we created the new Twitter target type and added a new dispatcher that only handled that one target type (this was all in our dev environment), the daemon would conflict with the other dispatcher. Which ever daemon picked up a ping first for it’s target type and then marked the ping as done was basically preventing the other daemon for processing the ping for its other targets.
The solution was to implement a new daemon, that we called Scheduler, and to move all the rescheduling code into this one serial process. Once we stripped all that out of the Ping model and the dispatcher code, we had a much leaner and faster system. We can now run as many dispatchers as our memory allows and configure them to handle various target types.