It’s 2016, why does still Presto take 24 hours to update my balance?
For those not in Ontario, Presto is our province-wide transit farecard system, which supports online topups, but with the caveat of requiring 24 hours for your balance to be updated. For those in Ontario, you’re probably already aware of this limitation there’s a good chance that you’ve heard some variation of the above quote. Indeed, it is 2016, so why does it take 24 hours for an online-topup? Well, the answer boils down to one of the underpinning principles of distributed systems - CAP Theorem.
First, I would like to clarify something. It is commonly believed that Presto always takes 24 hours to update your balance, but this is actually the maximum time it can take. In a lot of cases, especially if you are going to a train or subway station, the balance is actually updated much sooner. Metrolinx just cannot rely on all the readers in the system to have received an update any faster, so they cite the worst case scenario.
The CAP in CAP Theorem is an initialism for Consistency, Availability, and Partition Tolerance. In order to understand this theorem, we must first define these three terms:
Consistency is the ability of a system to always provide an up-to-date value. In a completely consistent farecard system, you would be able to add balance to your account online and have it instantly reflected at all terminals.
Availability refers to the system’s ability to always serve requests. So, in a completely available farecard system, assuming the terminal itself has not failed, it will be always be able to make a decision as to whether to allow you to board or not.
Partition Tolerance is often inaccurately thought to to mean that the algorithms at work can elegantly handle network failure (aka network partitions), but it actually refers to the ability of the system to be partitioned in the first place. So, a farecard system with multiple terminals must be partition tolerant. Until someone invents a networking technology that cannot possibly experience failure, a non-partition tolerant system would have to consist of a single terminal, which, for a farecard system, is quite useless indeed.
Now that we’ve defined the “CAP”, lets move on to the theorem. There are mathematical proofs that define exactly what the theorem entails, but for our purposes here, a more readable format suffices:
Consistency, Availability, Partition Tolerance - pick two
In other words, you have only three options for systems: consistent/partition tolerant, available/partition tolerant, and consistent/available.
It’s important to keep in mind that this only refers to complete partition tolerance, complete availability, and complete consistency - the theorem is not saying that it’s impossible to make an AP system more consistent, just that it will never be completely consistent without giving up complete availability or partition tolerance.
How this applies to Presto
As was mentioned earlier, in order to be useful, we must have partition tolerance in a practical farecard system. That means we can either choose to have a consistent/partition tolerant (CP) system or an available/partition tolerant (AP) system.
CP: Consistent/Partition tolerant
A consistent/partition tolerant system maintains its consistency by refusing to serve requests when a network partition that impairs its ability to reach a system-wide consensus occurs. While this does raise the stakes on network reliability, it can provide some compelling advantages is long as you make sure that the rate of network failures is quite low in practice.
For a system like Presto, this would mean that you’d have to have an extremely reliable cellular or satellite data connection throughout the province. While this certainly isn’t impossible, it does present a quite significant additional complication.
AP: Available/Partition tolerant
An available/partition tolerant system decides to forgo consistency in favour of operating on stale data during network partitions. This has the advantage of maintaining system operation during a network partition, but does mean that updates may not be applied immediately. While this has clear advantages in reliability, this has the unfortunate downside of resulting in people being allowed onto a bus when their card should be locked out or prevented from boarding when they they should be allowed on. As you may have guessed, this is how Presto currently operates.
Making the choice
Choosing between AP and CP systems is an engineering and business decision with no clear winner. Both of these approaches have tradeoffs, so it’s a matter of deciding which to set of tradeoffs is more palatable. In this case, the engineers and managers at Metrolinx decided that making sure the system does not require a hyper-reliable data connection at every terminal in order to operate was more important than issuing online top-ups instantaneously, which, in my opinion, was probably the right choice for a province-wide multi-agency farecard.
That said, Presto could operate faster. Currently Presto’s “standard” operating procedure seems to be to issue updates to bus-mounted terminals only when they are in the garage at the end of the day. Thus, it’s almost always operating with a partition, hence the 24 hour delay. However, there is no theoretical limitation that prevents the terminals from also having a data connection that lets them receive top-ups as long as they’re within range of a cell tower. As to why they don’t do this, nobody outside Metrolinx can really say. Perhaps it’s simply too expensive relative to the number of people it would serve. Maybe they actually do do this, but the buses just update so slowly in the field that it’s not apparent. Outside of Metrolinx itself, nobody really has access to the implementation details to see if they do, and even if they don’t, the usage numbers to really say if it would make sense to add the connection aren’t readily available, so we’re unfortunately stuck with educated guesswork.