On Friday, January 20th, we experienced a widespread outage that affected all Zoho services. The outage started around 8:13 am Pacific Time. Zoho services started coming back online for customer use at 3:49 pm, and all services were fully restored at 6:22 pm PST. We absolutely realize how important our services are for businesses and users who rely on us; we let you down on Friday. Please accept our humblest apologies.
The cause of the outage was an abrupt power failure in our state-of-the-art collocated data center facility (owned and operated by Equinix) in the Silicon Valley area, California. Equinix provides us physically secure space, highly redundant power and cooling. We get our internet connectivity from separate service providers. We own, maintain and operate the servers and the network equipment and the software. The problem was not just that the power failure happened, the problem was that it happened abruptly, with no warning whatsoever, and all our equipment went down all at once. Data centers, certainly this one, have triple, and even quadruple, redundancy in their power systems just to prevent such an abrupt power outage. The intent is that any power failure would have sufficient warning so that equipment, databases most importantly, can be shut down gracefully. In fact, the main function such data centers perform is to provide extreme redundancy in power systems, provide cooling for the equipment and provide physical security. Absolutely no warning happened prior to this incident, which is what we have asked our vendor to explain, and we hope they would be transparent with us. I do want to say that Equinix has served us well, they are a leader in this field, we have never suffered an abrupt power outage like this in 5+ years. But they do owe us and other customers in that data center an explanation for what happened on Friday. They restored power quickly, but the damage was done because of the abruptness of the outage.
As of today, while we have a substantial level of redundancy in the system, we still rely on our data center provider to prevent an abrupt power outage (it happened once, so it could happen again), and we are scrambling to prevent another power outage from becoming a service outage of the duration that happened Friday. Those are literally the first steps we are taking. This includes having our own separate UPS systems (in addition to all the UPS systems, generators and cleaned-up utility power that our vendor provides in the data center), and database servers that have batteries in them so they can be gracefully shutdown in an event like this.
Now let me acknowledge that it took us way too long to recover, and let me explain first why it took so long, and then explain what we are going to do about it in the future. In a nutshell, when every database cluster and every server went down, the sheer amount of testing and recovery work overwhelmed our human-in-the-loop recovery system. There was never any issue with the safety of the data itself.
We have a massively distributed system, and the design intent of such a distributed system is that everything would not fail at once and parts of the system can and do fail without impacting overall service availability. The problem was that when the entire system went down, it required manual recovery. We had about 20 people working to restore services, but there are well over 100 clusters, of which about 40% of them had errors – basically the redundant database servers within a cluster were out of sync with respect to each other. The inconsistency across replicated instances is usually very slight – perhaps a few bytes off in a 100 GB instance, but the only thing that matters is that there is inconsistency, however slight. This is recoverable without any data loss (except for the data that was entered just at the exact moment when power went down). This process is necessary to ensure that there is no data corruption and all data is consistent across the replicated instances. In most instances this was fast, but in some instances recovery took time, and the number of such slow-to-recover instances caused delays in the overall recovery. In fact, the first few clusters we tested were OK, so we relied on that to provide an estimate of recovery time that proved too optimistic, based on later instances that had a problem. There were way too many such clusters that took time for the 20 people to recover them in parallel. In effect, the human system was overwhelmed by the scale of the problem. That’s why it took us so long to bring all services back up.
We do have all data mirrored in a data center in the New York region (also owned and operated by Equinix) and that data center was not affected by the power outage. All the data was present in that secondary data center, so there was never any possibility of any data loss, even if all our primary servers had been wiped out completely. But we do not have sufficient capacity to run all Zoho services from that secondary data center as of today. We have 3 copies of your data in the primary data center, and usually 1 or sometimes 2 copies in the secondary data center. That means that we do not currently have: a) sufficient data redundancy in the secondary data center by itself to run all the services - i.e assuming the primary data center is totally dead, or b) sufficient computing capacity to process all the traffic by itself in the secondary. Our secondary data center serves to protect customer data but it could not serve all the traffic. We intend to address this issue ASAP, starting with some of our services first.
Our first focus is on preventing an outage like this from happening again, and the second is faster recovery when disaster strikes. We have been working on this second problem for a while already and we will accelerate this process. Additional steps we are taking include: a) offer better offline data access so customers never have to go without their mission-critical business information b) offer read-only access to data on the web quickly, so at least access is preserved while we work to recover the editable instance and c) more automation so recovery from a large scale incident can happen with less manual intervention.
During this entire episode, our first priority was to make sure customer data remained safe. No customer data was lost, but because incoming mail server queues overflowed (the mail store went down), some mail bounced back. We are working on preventing such a thing from happening again, with a separate mail store instance.
We will keep you steadily updated on the progress we are making on each of these priorities. Hardware progress is going to be the fastest (buy and install new systems), and software is going to be the slowest (implementing better automation for faster recovery is going to take time), but we will keep you posted on all the progress we make. That is our promise.
This was, by far, the biggest outage Zoho has ever faced. We absolutely understand that many businesses rely on Zoho to go about their work on a daily basis. We can understand how many customers were disappointed and frustrated by this outage. We too, are extremely upset about this incident.
In the coming days we will be refunding a week’s worth of your subscription to each and every single customer, whether you complained or not. We know money will not give you back the time you lost, or compensate you for the hassle and trouble, but we hope you’ll accept it with our deepest apologies. While the money is not going to mean anything to any single customer, at an aggregate level, it does affect us, and that punishment would serve as a reminder to ourselves not to let this happen again. That is ultimately the best assurance I can give.