Our Friday Outage and Actions We Are Taking

On Friday, January 20th, we experienced a widespread outage that affected all Zoho services. The outage started around 8:13 am Pacific Time. Zoho services started coming back online for customer use at 3:49 pm, and all services were fully restored at 6:22 pm PST. We absolutely realize how important our services are for businesses and users who rely on us; we let you down on Friday. Please accept our humblest apologies. 

The cause of the outage was an abrupt power failure in our state-of-the-art collocated data center facility (owned and operated by Equinix) in the Silicon Valley area, California. Equinix provides us physically secure space, highly redundant power and cooling. We get our internet connectivity from separate service providers. We own, maintain and operate the servers and the network equipment and the software. The problem was not just that the power failure happened, the problem was that it happened abruptly, with no warning whatsoever, and all our equipment went down all at once. Data centers, certainly this one, have triple, and even quadruple, redundancy in their power systems just to prevent such an abrupt power outage. The intent is that any power failure would have sufficient warning so that equipment, databases most importantly, can be shut down gracefully. In fact, the main function such data centers perform is to provide extreme redundancy in power systems, provide cooling for the equipment and provide physical security. Absolutely no warning happened prior to this incident, which is what we have asked our vendor to explain, and we hope they would be transparent with us. I do want to say that Equinix has served us well, they are a leader in this field, we have never suffered an abrupt power outage like this in 5+ years. But they do owe us and other customers in that data center an explanation for what happened on Friday. They restored power quickly, but the damage was done because of the abruptness of the outage.

As of today, while we have a substantial level of redundancy in the system, we still rely on our data center provider to prevent an abrupt power outage (it happened once, so it could happen again), and we are scrambling to prevent another power outage from becoming a service outage of the duration that happened Friday. Those are literally the first steps we are taking. This includes having our own separate UPS systems (in addition to all the UPS systems, generators and cleaned-up utility power that our vendor provides in the data center), and database servers that have batteries in them so they can be gracefully shutdown in an event like this.

Now let me acknowledge that it took us way too long to recover, and let me explain first why it took so long, and then explain what we are going to do about it in the future. In a nutshell, when every database cluster and every server went down, the sheer amount of testing and recovery work overwhelmed our human-in-the-loop recovery system. There was never any issue with the safety of the data itself. 
We have a massively distributed system, and the design intent of such a distributed system is that everything would not fail at once and parts of the system can and do fail without impacting overall  service availability. The problem was that when the entire system went down, it required manual recovery. We had about 20 people working to restore services, but there are well over 100 clusters, of which about 40% of them had errors – basically the redundant database servers within a cluster were out of sync with respect to each other. The inconsistency across replicated instances is usually very slight – perhaps a few bytes off in a 100 GB instance, but the only thing that matters is that there is inconsistency, however slight. This is recoverable without any data loss (except for the data that was entered just at the exact moment when power went down). This process is necessary to ensure that there is no data corruption and all data is consistent across the replicated instances. In most instances this was fast, but in some instances recovery took time, and the number of such slow-to-recover instances caused delays in the overall recovery. In fact, the first few clusters we tested were OK, so we relied on that to provide an estimate of recovery time that proved too optimistic, based on later instances that had a problem. There were way too many such clusters that took time for the 20 people to recover them in parallel. In effect, the human system was overwhelmed by the scale of the problem. That’s why it took us so long to bring all services back up. 

We do have all data mirrored in a data center in the New York region (also owned and operated by Equinix) and that data center was not affected by the power outage.  All the data was present in that secondary data center, so there was never any possibility of any data loss, even if all our primary servers had been wiped out completely. But we do not have sufficient capacity to run all Zoho services from that secondary data center as of today. We have 3 copies of your data in the primary data center, and usually 1 or sometimes 2 copies in the secondary data center. That means that we do not currently have: a) sufficient data redundancy in the secondary data center by itself to run all the services – i.e assuming the primary data center is totally dead, or b) sufficient computing capacity to process all the traffic by itself in the secondary. Our secondary data center serves to protect customer data but it could not serve all the traffic.  We intend to address this issue ASAP, starting with some of our services first.

Our first focus is on preventing an outage like this from happening again, and the second is faster recovery when disaster strikes. We have been working on this second problem for a while already and we will accelerate this process. Additional steps we are taking include: a) offer better offline data access so customers never have to go without their mission-critical business information b) offer read-only access to data on the web quickly, so at least access is preserved while we work to recover the editable instance and c) more automation so recovery from a large scale incident can happen with less manual intervention.

During this entire episode, our first priority was to make sure customer data remained safe. No customer data was lost, but because incoming mail server queues overflowed (the mail store went down), some mail bounced back. We are working on preventing such a thing from happening again, with a separate mail store instance.

We will keep you steadily updated on the progress we are making on each of these priorities. Hardware progress is going to be the fastest (buy and install new systems), and software is going to be the slowest (implementing better automation for faster recovery is going to take time), but we will keep you posted on all the progress we make. That is our promise.

This was, by far, the biggest outage Zoho has ever faced. We absolutely understand that many businesses rely on Zoho to go about their work on a daily basis. We can understand how many customers were disappointed and frustrated by this outage. We too, are extremely upset about this incident. 

In the coming days we will be refunding a week’s worth of your subscription to each and every single customer, whether you complained or not. We know money will not give you back the time you lost, or compensate you for the hassle and trouble, but we hope you’ll accept it with our deepest apologies. While the money is not going to mean anything to any single customer, at an aggregate level, it does affect us, and that punishment would serve as a reminder to ourselves not to let this happen again. That is ultimately the best assurance I can give.

Sridhar 

Comments

28 Replies to Our Friday Outage and Actions We Are Taking

  1. Mike Mosman makes the most pertinent point in this explanation. Yes, a redundancy review would be in order.Mike's reply was:
    The two phrases “highly redundant power” and “all our equipment went down at once” are mutually exclusive. It indicates the power system(s) delivering power to the servers was/were subject to a single-point-of-failure (SPoF). A highly redundant power supply would be SPoF free. A redundancy review is in order here.

  2. Hi,Is the $99 subscription fee for housing my documents on http://www.zoho.com ?" rel="nofollow">www.zoho.com? What if I don't need need to set up projects and tasks on Zoho projects, but just want to house documents under documents? Is there a fee for that?Thanks,Karen Fox

  3. Hi there,
    I have a suggestion. Acutually you can never guarantee a datacenter never goes down completely. For me it's just a wrong assumption. And since you already have 2nd datacenter, maybe Zoho should consider how to archive the goal that when the primary datacenter is dead, the 2nd datacenter should take over and handle all the traffic. I understand the 2nd datacenter is not ready for this task as of now, but maybe Zoho should consider to prepare it for the future. Take an extreme example, no matter how much redundancy you put with your primary datacenter, if an attack like 911 happens to the primary datacenter, then the only hope is the 2nd datacenter. My company have multiple datacenters, and our DR plan is just to assume that one of the datacenter is COMPLETELY dead, as what Zoho has just suffered last Friday. So think about it. You need datacenter diversity to archive higher availability. I like Zoho products, so hope it will be better. Thank you.

  4. I found Zoho more reliable than our 5 man IT team. I have lost more data in house using Oracle due to dumb mistakes than ever using Zoho. I never lost one record let alone any corrupted record. Keep up the good work. And I agree with one commenter that you should consider a different company for backups. There are many things that can happen in one company intentionally or accidentally.

  5. I am appreciative of your taking responsibility for the outage and your efforts to remedy the exposure in the future.In my background was the responsibility of providing redunant processing to a major bank, albeit some years ago I would suggest that you ask your redundancy design team if they would like to fly in an airplane they designed. It should make the point.While your outages are not life and death, the very question will lead you to evaluate circumstances that may be overlooked.The real damage is not to your customers and their loss of the use of their data and system, but to your companys credibility which will be much harder to repair than the redundancy issues.Your products are of reasonable function and quality and I wish you all the success in a positive future.

  6. Justin, in this case, using the same provider poses no risk as equinix is not a network provider and each data center operates independently. What is more disturbing is that the UPS systems within the data center obviously didn't workEquinix, you got some 'splainin to do!

  7. Thanks for the reply. I received a mail with the blog entry only today (2/9/12), apparently due to my call yesterday to your help line (my laptop cache was causing it to appear that your servers were down again yesterday). During the outage I wrote an email which bounced. Like the others, I appreciate the detailed explanation, but from my perspective it came a bit late. If I hadn't called yesterday, I suppose I would have never known what happened. I was beginning to wonder if I should switch providers. Anyway, all is forgiven.

  8. Thanks for a good explanation. I hope you'll relay Equinix's explanation at some point. From our viewpoint, read-only access to our CRM data would have considerably lessened the disruption to our business, so I am glad you are planning to implement that. Thanks for your good service. I continue to recommend Zoho.

  9. Hi Sridhar . It's great that you guys are looking into improving the system overall. However, we remain concerned that Zoho has a really poorly-driven backup management strategy (on the client face), and pretends to monetize on a feature that it is the most basic of sysadmin 101.After the outage, we suggest you revise this less than ideal world. Customers don't mind paying for advanced features, etc, but backups? There are certain things in brand and product development that require simple common sense, not formulas to make money on basic infrastructure. We are sure that if you revise how much money you are making with backups you will find the value ludicrous. Perhaps this a legacy vision from the early days but to be a cloud prominent player, enterprise-ready, Zoho needs to move forward now. The industry has lots of snapshots solutions for databases, even at the field level. Zoho needs to provide this kind of in-grain usability as a standard feature for paid editions. Clients should be able to self-service and even restore modules, records even fields.We think that you guys are doing really important work, but there are certain things that require a dynamic approach to get solutions faster and more in touch with your client base. Some of the basic requests from users have been around for years, just read the forums.In general we think that Zoho is a great company with great products and a good vision. It just needs minor refinements.Thank your for the refund, a small amount of great value.

  10. Zoho Projects is excellent. Sounds like it will continue to get better. We did not lose any data though the outage was inconvenient. Thank you for this serious response on a serious issue.

  11. Yes the explanation is transparent now. At the time - when it was critical - it was not. I found Twitter on my own to keep updated. I have yet to receive word directly from ZOHO about any of this. I have made the effort to go out and find the info on blogs like this. That's inappropriate. Also in appropriate is to think that 1 week's complementary service makes up for the extensive downtown time and lack of productivity in our business during the outage. As the gentleman form Australia noted "the cost of wages is the key issue and the 1 weeks refund is very negligible."All the best moving forward.

  12. It is so seldom that companies give a detailed explanation such as the one we have been graced with by Zoho. I am no IT expert, but their explanation was thorough yet simple enough for me to understand. A absolutely big 'thank you' to all of the Zoho Team for taking the human approach as opposed to the corporate approach (which would have released a vague one-liner rather than a detailed explanation).

  13. Thank you for the detailed explanation and for taking responsibility for correcting the situation. I also thank you for the return of the week's charge, however I agree with the above copy that customers should be charged a lesser fee for a back up copy of their data given this recent outage. I also very much like your idea to have offline services and read only access during these times. Lastly, please keep us customers updated on the mail situation- we were actually thinking about switching our mail over to zoho and now I'm a little wary.

  14. Hi,At least you were honest enough to admit and be open about the faults, rather than giving some vague spin (making up stories) as other companies would have done. I respect Zoho for this.
    1) Agree on the Power Redundancy review. (I'm guessing the Backup batteries were not maintained as regularly as they should have been and also possibly underspecified, therefore once AC power trips and backup system (UPS)/DC to AC convertor/Invertor kicked in, the DC supply may not have had enough stamina and also shutdown due to this reason.
    A periodic simulated fault is a good way to test your system out in real life under a fault condition.
    2) Just like the Iphone can download all data to access CRM records, please make a windows app, so that users can store offline also. Even if all components of Zoho are not back online, at least simple client contact details, email address and basic info can be accessed if offline app is available, so our staff don't suffer downtime for extended period. In Australia, the cost of wages is the key issue and the 1 weeks refund is very negligible.
    best regards,
    Jason

  15. It is a terrible idea to have both your primary and backup systems serviced by the same vendor. The idea should be to minimize risk; aggregating services does not accomplish such.

  16. First, I would like to thank the team for letting us know about the progress on Twitter. Thanks also for bringing back the services with no data loss. This is critical to demonstrate that our data is safe in your hands. For a SME with previous painful experience with in house servers, this is vital.I would like now to offer some suggestions:
    1) During contingency procedures, a key element is communication. I am not sure all your customers have a Twitter account to follow up progress on an event like this. This is why you should communicate on the other media (like your Website). I am not sure you need to go up to the point to send a text message to administrators cell phone but it would be a big plus.
    2) On your second site (New York), it would make sense to have some services running as well in case of failure. For example, we use Zoho Mail and it is absolutely necessary to have it working. We could not notify our customers (which are not using Zoho Mail) about the Zoho failure without using the phone.We always learn from failures and incidents. I hope this will help you improve your resiliency procedures.

  17. Though I never like things to go wrong, Zoho has been more reliable than my own network ever was. I am grateful for the explanation and even more confident in Zoho than I was before. THANKS!!

  18. The two phrases "highly redundant power" and "all our equipment went down at once" are mutually exclusive. It indicates the power system(s) delivering power to the servers was/were subject to a single-point-of-failure (SPoF). A highly redundant power supply would be SPoF free. A redundancy review is in order here.

  19. Also I would try not to send out emails notifying customers of an increase in pricing DURING future outages (like the one I received on Friday) - probably not the best timing...

  20. Agree wholeheartedly with Joe. I think a better option would be offline access for some services like CRM, but barring that, having some no (or at least lower) cost way to take regular backups is essential. This is needed not just to secure against failures at Zoho's end, but also failures in infrastructure at the customer's end. For instance, just a couple of weeks before this outage our internet circuit went down and it took over 24 hours (almost 2 full working days) for us to get it back in place. Obviously this was no fault of Zoho's, but the two events back to back are causing us to rethink the viability of a 100% cloud based model.

  21. Thank you for the detailed explanation. We're playing catch up this morning so I don't have time to go into how the outage affected our business, but I do have one major suggestion for Zoho. Please do not charge your customers the $10 fee associated with making a backup copy of our own database. Our company used Salesforce in the past and we had the ability to schedule weekly backups of our CRM data free of charge once a week. With your offering, we'd have to pay an additional $40 \ month to accomplish the same thing. Had we had a backup of our database we could have at least used the raw data to look up the important records we needed last Friday.

  22. I'm sure there are lots of lessons to be learned. I would like say a word of thanks to your team, the stress must have been unimaginable. I suspect they have aged a few years in a few hours. All my customers have there data and systems back in action. Ironically the recovery process has confirmed that the Cloud is a very safe place to store data.
    Daragh

Leave a Reply

Your email address will not be published.

The comment language code.
By submitting this form, you agree to the processing of personal data according to our Privacy Policy.

Related Posts