Root Cause Analysis of our December 14 Outage

We had a 3.5 hour outage of Zoho Mail, Zoho Support & Zoho Books, between 8:45 AM and 12:15 pm PST on December 14. First of all, I want to apologize to our customers and our users. We let you down and we are sorry. We know how important it is to have access to the vital business information you entrust with Zoho; our entire company runs on Zoho applications, so we understand this in an even more intimate way. With a view to preventing such incidents and improving our response to them when they do happen, we reviewed the root cause of this outage and our response to it. This post provides a summary.

The outage arose from a simple configuration gap between our software applications and the network. One part of the health-check mechanism built into our software (ironically, the very part that is designed to prevent an outage impacting customers) made an unneeded reverse DNS request to resolve an IP address to a name. The network stack did not adequately provide for this reverse DNS look-up; it worked until it stopped working. In effect, we had a single point of failure that our network ops team was unaware of, because the software had this implicit dependency.

When the reverse DNS failed, the health-check mechanism (incorrectly) concluded that the software applications were failing, and proceeded to restart the “failing” application servers one by one, as it is programmed to do. Of course, even after application servers restart, the health-check would still fail, because the failure was not due to the software itself.

Since the failure was happening in a disparate set of applications that share no resources (no physical servers, no file systems or databases in common) other than being part of a sub-network, the initial suspicion was focused on the switches serving that sub-network. In reality there was a shared dependency but that was not immediately identified. This wasted precious time. In the end, the reverse DNS problem was identified, and the fix itself took just a few minutes.

Here are the lessons we learned on December 14:

1. Subtle information gaps can arise between teams that work together – in this case the network ops and the software framework teams. The software had a dependency on a network component that the ops team did not appreciate, which created an unintended single point of failure. The failure mode was always there, and on December 14 it came to the surface.

Action: We will make the configuration assumptions made by every piece of software much more explicit and disseminate them internally. Our monitoring tools also will be strengthened to check the actual configuration against the assumptions made by software.

2. After the outage, precious time went on testing various hypotheses, though the root-cause, as it turned out, was quite simple. This is the most stressful period, and some of that was inevitable, which is why prevention is so vital. We had 5 people checking out various aspects of the system, but they were not aware of this software dependency. If they had known, it would have taken a few minutes to fix it, and instead the outage lasted 3 hours.

Action: We are reviewing our incident response procedures, to bring in people with relevant knowledge on the spot more quickly. We will also provide more training to our operations team members, so they could diagnose and troubleshoot a broader set of problems. Our monitoring tools also would be strengthened to provide more diagnostic information.

3. This is a more fundamental, mathematical problem in any feedback loop: adaptive, fail-safe mechanisms can have unforeseen or unintended behavior and ultimately cause failures themselves. Basically a failure is declared, and action taken, and if the diagnosis of failure is wrong (often very subtly wrong) and therefore the action taken is not appropriate, those actions can then feedback into the fail-safe mechanism. We have humans in the loop for this precise reason, but in this case, there was a single point of failure that the ops-team-on-the-spot was not aware of, so they could not stop the cascade.

Action: We are reviewing our fail-safe mechanisms to identify such cascades and involve the human-in-the-loop better.

To summarize, we believe the failure was preventable, and in any event, the outage should have been resolved a lot sooner. Once again, please accept our apologies. We have resolved to improve our tools and internal processes, so we could do better in future.

Sridhar Vembu

Comments

8 Replies to Root Cause Analysis of our December 14 Outage

Bruce Guntersays:
January 22, 2012 at 6:31 AM
It wasnt just crm that went down, it was mail as well. It effectively killed our business, AGAIN, for the second time in about a months time. I hate to give up Linux and Zoho for Microsoft and Dynamics, but simply cant afford to support Zoho while they get their stuff sorted. BTW, I had asked for a refund of last months fees- fell on deaf ears with ZERO response. In no way compensates for actual losses we incurred- just was hoping they could acknowledge the financial implications these issues cause. I also assume the Zoho moderators will scrub this post as well- seems the ones I attempted last go around never made public view. Corporate censorship at its best...
Kert Thielensays:
January 21, 2012 at 3:51 AM
Zoho CRM has been down for at least 8 hours today. I thought you were making changes so this would not happen.I heard through the rumor mill that your colo had a power outage. I thought you had multiple colos in different areas of the US and world????
Vijaysays:
December 26, 2011 at 10:09 PM
Not many will inform the reasons behind the outage so transparently. Customers will appreciate this.Also, I think following can also be implemented, if not being done now.1. Every change in the system should be reviewed by every team involved. Eg. A dev. team change should be approved by testing team, ops team, n/w team and production support team2. If a conference call is opened at the time of outage - bringing every team in the call, analysis could be done much more quicker.3. Every stake holder of a system should have a copy of architecture diagram of the system. The architecture diagram should give detailed info about the server details, protocols implementation and how the components interact with each other.thanks for sharing Zoho's experience in the outage. It is a learning for me too.thanks
sallyasays:
December 20, 2011 at 1:25 PM
Thank you.
Nathansays:
December 19, 2011 at 9:57 PM
Good post - thank you. Everyone will have outages, it's how they're handled that set companies apart.
David SÃ¡nchezsays:
December 19, 2011 at 7:18 PM
Thanks for the info and for your transparency.
Best,
David
Robertsays:
December 19, 2011 at 6:43 PM
Thanks for your transparency!
ltsays:
December 19, 2011 at 7:57 AM
I am glad that you posted this information online. It always seemed to me that you have too much DNS dependency in your WebNMS.

Comments

8 Replies to Root Cause Analysis of our December 14 Outage

Leave a Reply Cancel reply

Related Posts