We had a 3.5 hour outage of Zoho Mail, Zoho Support & Zoho Books, between 8:45 AM and 12:15 pm PST on December 14. First of all, I want to apologize to our customers and our users. We let you down and we are sorry. We know how important it is to have access to the vital business information you entrust with Zoho; our entire company runs on Zoho applications, so we understand this in an even more intimate way. With a view to preventing such incidents and improving our response to them when they do happen, we reviewed the root cause of this outage and our response to it. This post provides a summary.
The outage arose from a simple configuration gap between our software applications and the network. One part of the health-check mechanism built into our software (ironically, the very part that is designed to prevent an outage impacting customers) made an unneeded reverse DNS request to resolve an IP address to a name. The network stack did not adequately provide for this reverse DNS look-up; it worked until it stopped working. In effect, we had a single point of failure that our network ops team was unaware of, because the software had this implicit dependency.
When the reverse DNS failed, the health-check mechanism (incorrectly) concluded that the software applications were failing, and proceeded to restart the “failing” application servers one by one, as it is programmed to do. Of course, even after application servers restart, the health-check would still fail, because the failure was not due to the software itself.
Since the failure was happening in a disparate set of applications that share no resources (no physical servers, no file systems or databases in common) other than being part of a sub-network, the initial suspicion was focused on the switches serving that sub-network. In reality there was a shared dependency but that was not immediately identified. This wasted precious time. In the end, the reverse DNS problem was identified, and the fix itself took just a few minutes.
Here are the lessons we learned on December 14:
1. Subtle information gaps can arise between teams that work together – in this case the network ops and the software framework teams. The software had a dependency on a network component that the ops team did not appreciate, which created an unintended single point of failure. The failure mode was always there, and on December 14 it came to the surface.
Action: We will make the configuration assumptions made by every piece of software much more explicit and disseminate them internally. Our monitoring tools also will be strengthened to check the actual configuration against the assumptions made by software.
2. After the outage, precious time went on testing various hypotheses, though the root-cause, as it turned out, was quite simple. This is the most stressful period, and some of that was inevitable, which is why prevention is so vital. We had 5 people checking out various aspects of the system, but they were not aware of this software dependency. If they had known, it would have taken a few minutes to fix it, and instead the outage lasted 3 hours.
Action: We are reviewing our incident response procedures, to bring in people with relevant knowledge on the spot more quickly. We will also provide more training to our operations team members, so they could diagnose and troubleshoot a broader set of problems. Our monitoring tools also would be strengthened to provide more diagnostic information.
3. This is a more fundamental, mathematical problem in any feedback loop: adaptive, fail-safe mechanisms can have unforeseen or unintended behavior and ultimately cause failures themselves. Basically a failure is declared, and action taken, and if the diagnosis of failure is wrong (often very subtly wrong) and therefore the action taken is not appropriate, those actions can then feedback into the fail-safe mechanism. We have humans in the loop for this precise reason, but in this case, there was a single point of failure that the ops-team-on-the-spot was not aware of, so they could not stop the cascade.
Action: We are reviewing our fail-safe mechanisms to identify such cascades and involve the human-in-the-loop better.
To summarize, we believe the failure was preventable, and in any event, the outage should have been resolved a lot sooner. Once again, please accept our apologies. We have resolved to improve our tools and internal processes, so we could do better in future.