Update on our downtime

This morning, there was a power failure in our California data center – Equinix – taking all Zoho services down. As our services went down, it down all services, this means it caused database inconsistencies.

Services have multiple database clusters – like 8-12 database clusters per each service. Each cluster having a Master-Slave combination. Power failure caused inconsistencies in some of these clusters. But to restore these services fully, we need to make sure all these database clusters are consistent. This database consistency check and sync is what is taking time.

Realistically, it is going to take few hours to restore our services after making sure the databases are in sync. Some services will be restored sooner based on the level of inconsistency.

We sincerely apologize for this issue. Will keep you updated.

Update (1:35PM): Zoho Mail is partially restored. Users in affected clusters will be able to access it after we restore that cluster. For now, most Zoho Mail users should be able to access Zoho Mail.

Also, please note that Zoho Support is also Restored.

Update (1:40PM): Zoho Analytics is now restored.

Update (1:48PM): Zoho Writer is now restored.

Update (2:25PM): Zoho Calendar & Contacts are restored. We still are working on fully restoring Writer. Will keep you posted.

Update (2:55PM): Zoho Writer is back up. Zoho Discussions & Zoho Books are restored.

Update (3:15PM): About 60% of database clusters have been restored currently for Zoho CRM. We will be making these available first. Users on these clusters will be able to access their data. As and when other clusters are restored, we will bring them online.

Update (3:50PM): Zoho CRM is partially restored. Around 80% of users should have access to their data. Other 20% will be restored soon. Will keep you updated.

Update (4:00PM): Zoho Invoice is restored.

So far, the following services are restored: Zoho CRM, Books, Invoice, Mail, Support, Reports, Discussions, Calendar & Writer are restored. Others will follow soon.

Update (4:20PM): Zoho Projects and Wiki are currently restored.

Update (4:28PM): Zoho Recruit & People are now restored.

Update (4:35PM): Zoho CRM is fully restored. Zoho Show & Bug Tracker are also restored. Zoho Docs is partially restored.

Update (4:45PM): Zoho Mobile service is now restored.

Update (4:52PM): Zoho Sheet & Docs are partially restored

Update (5:35PM): Zoho Chat & Share are currently restored.

Update (5:56PM): Zoho Creator is now restored.

Update (6:12PM): Zoho Meeting is now restored.

All Zoho Applications are currently restored. There are a few backend services currently being restored to full service.

We will do a detailed blog post with the postmortem report. Again, apologize for the outage.

 

Comments

78 Replies to Update on our downtime

  1. Note to self: Do not host servers at Equinix.
    What I don't understand - your data center had backup generators - no ? And the servers had working UPS on them, no? And you have an SLA with them, no? If yes, then how could there be a complete power outage? Sounds to me like Equinix did not honor their service level agreement - I'd be suing the pants off the data center and then finding a professional data center to work with. This little fiasco has cost Zoho dearly on so many levels. You need to take action on behalf of your customers so it does not happen again.

  2. Is anyone else having trouble with the search function? When I search for a clients name it will not bring back the clients file.I would not say we are fully restored.

  3. Thanks Raju! Just logged in and everything is working fine :) thanks for your quick updates and getting it back up & running as soon as possible. Leanne

  4. Currently, Creator is our top most priority. We are working on it. Creator has a large database which is what is taking time.

  5. The data center (Equinix) has a primary UPS backup and then a generator as the secondary backup. It looks like the system that automatically fails-over to these backups went down causing system wide outage.

  6. I am sorry to hear this.
    As I am following the blog, not yet able to get the email up for me & my org.
    Is it possible to have a notification call to us once the required services are really up?
    And, we would like have a technical & operational details of this failure and would be happy to discuss your plan of action to prevent such things in future.As I understand, it is your sync dB issue, and might not help even to have multilocation hostings. Nevertheless, it might be worthwhile to explore possibility to host/co-host in India please. Thanks.

  7. I agree Rich, a power outage and database synchronization should definitely not take 8 hours to recover from. There has to be other issues going on here.

  8. Hmmm. Maybe I am missing the point. I just speaking directly to contact information in CRM. You could have an offline source if you needed one by using SQL express and importing your backup data into your own database. Now the front end application is a different story but at the very least you would have access to tables of data for your customer information. Of course your work flow would slow down significantly but I was speaking to the post where a business came to a stand still because they can't access customer information. I am not speaking about email, that is a whole different animal and a plan for that would need to be formulated, but then again I do not use Zoho for email, only Customer Information like names, phone numbers and addresses with notes. I am not defending Zoho, or Equinex as I think this issue is taking way longer to correct than it should, I'm just saying you should have your own DRP for when something like this does happen. Regards,

  9. You are missing the point GK. Sure, we have all the data backed up and can go off of spreadsheets but do you think the workflow stays the same, especially when you know that you have to re-enter this data in the main repository. DRP when you are vendor dependent is different from when you house your own technology. Example: is it is easy to DRP for an Exchange Failure, but not for a proprietary database service when there are NO offline options. It is clear that you have never been in that situation, and if you think a company can run off of single data extract once a month, then you are badly mistaken. Shame on you my friend.

  10. Last time I looked into it, it wasn't an option, is it now? I'll have to look into that if it ever gets back up and running...Thanks

  11. I wanted to bring one thing to your attention.Update (1:48PM): Zoho Writer is now restored.Update (2:25PM): Zoho Calendar & Contacts are restored. We still are working on fully restoring Writer. Will keep you posted.The 2.25pm Update contradicts the 1.48pm Update. Obviously Zoho Writer was not restored or you would not be working on restoring it.Please correct this if possible. It does reflect poorly on your PR end.That being said, I hope this issue is resolved quickly. For your benefit, as well as your customers.

  12. You're correct partially for me. But the calendar is in Zoho and is a "live" document, it wouldn't help to extract it monthly or even weekly. So even if we had the contacts it wouldn't help to book or confirm appointments if we don't know when they are...

  13. This is completely ridiculous...I own a IT company and my clients are better protected than your are. I find it ludicrous in this day and age that your data center would be affected by a power outage... I had proposals to get done today and had to move meetings into next week... which simply pushes projects back and I don't get paid as soon... thank you ZOHO.... I am going to have to look at different solutions... unacceptable....
    Rich

  14. Unacceptable is the right word. I have 12 people using an enterprise solution through Zoho CRM. Sales and operations is down because of this lack of competence. I calculate a $36,000 loss of productivity today due to Zoho. PLEASE Contact me directly so you can credit my account for the next 10 years!

  15. Our colocation provider says the computer that manages the switch over has failed. They do have multiple backups. This was unfortunate. While the power was restored quickly, the instant shutdown caused sync issues across our database clusters. We are doing consistency checks before we bring services back up, which is what is taking time and hence the delays.

  16. I'm with @CM, Geo Mirroring isn't in place or this wouldn't be an issue. We have all of our client data in Zoho CRM and rely on this to be up 24/7 like our local File/Mail server.

  17. Its funny reading all these posts about DRP and how many businesses rely on Zoho's CRM only. Do you not have a DRP of your own? I have extracts of all of our customers data in both excel and csv format. Something to consider if businesses run into this issue again. I understand the frustrations of businesses and a service you pay for not being available but to rely on only one source is a mistake that you are making as a business by not having your own DR plan for issues like this. Extracting the data once a month would probably alleviate most of these concerns in most of these posts and would not take that long to do. Shame on you.... for allowing one single service to bring your business to a screeching halt.

  18. Do you have employees? I have 12. And when I pay their paychecks, it's with the expectation that they work. We pay for zoho, have dealt with a slow site multiple times, but this is the breaking point. It has cost us more today than salesforce will cost for 6 months. What's disappointing is how much time and energy we've spent getting the zoho CRM tweaked for our business.If you told a client 12 times you were going to do something asap, and you didn't do it, would you expect them to still be a client?

  19. What I don't understand is how does a data center have a power outage and not have backup power. I think something else happened and they are restoring their backups. Our server clusters boots right up on reboot and works in 5 minutes. Longest would take is 1 hour. Unless its something more serious.

  20. That is simply not true. Forty minutes after its "restoration" I have no access to email. And Zoho's own uptime statistics show that NOTHING has been restored.

  21. This is totally unacceptable, I am sending one of my sales people to LA next week and she can't get to her client list to confirm the appointments. The other is going to NY next week and her assistant can't book her appointments because the calendar is tied up in Zoho CRM! This down time is causing serious monetary damage here, way more than the cost of my subscription. We need some guarantees that this doesn't happen again EVER! Depending on how things work out next week, today's disruption would cost me more than a year of salesforce for my small sales team of 4, something for me to consider...

  22. Thank you for the update ZoHo. I wonder how many of the complainers actually pay for service. As a free user, I will patiently wait for you to do your thing as I understand things like this frequently happen and are out of your control. Happy Friday.Ryan C

  23. This unacceptable!!!I have been down for 6 hours now with the promise of Zoho CRM going to be back up and running every half hour that I call…We believed in Zoho and have become totally reliant on Zoho CRM for every aspect of our business. This problem has brought my company to a complete stand still for hours now. We have missed many appointments and lost many opportunities with potential customers. There is no excuse as to why a company like Zoho does not have dual home or backup power programs in place. I want to know what you are going to do to reimburse my company for the thousands of dollars I have lost due to this problem. I would also like a reason as to why our company should take another chance with Zoho and what Zoho is going to do different in the future so that such a problem is not possible in the future.

  24. To err is human, but to be irresponsible in protecting millions of customers is shameful. Your website says the following: "Geo Mirroring. Customer data is mirrored in a separate geographic location for Disaster Recovery and Business Continuity purposes. Please note geo mirroring is available on select products and plans."We are a premium Mail Suite/CRM/Projects/Books customer and none of these applications have worked all day. Please clarify for all of us which products have geo mirroring or any type of data back-up or protection. You owe it to your customers to honestly explain how are data is protected (and to what extent) if we decide to continue with ZOHO. We really like using the ZOHO suite of products and sincerely hope you can address our concerns. Thank you.

  25. @wdc
    We started restoring services. Zoho Mail, Support & Writer are restored currently. Others will follow. CRM is a priority.

  26. Zoho, your customers should never have known about a power failure. The fact that they do know, and worse, that it was allowed to deny services for any length of time is inexcusable! This is pathetic and should cost several person's jobs. This has not simply inconvenienced your customers. It has caused monetary damages to those who rely on your services for their businesses!
    Zoho, your decision not to plan for, implement and routinely test backup and fail-over procedures represents blatant contempt towards your customers. I will definitely be looking for alternatives.

  27. I'm floored. An entire day lost. Do we have a NEW ETA? This does not give me a good feeling about migrating my entire company to this platform.

  28. I would hate to think what would happen if there was a major earthquake in So Cal. I am amazed that you do not have multiple availability zones in separate geo-regions with failover at the DNS level. This is unheard of in today's Cloud Computing environments. Even basic (start-up) websites and services are prepared for this. It sounds like cutting corners to keep costs down is part of your business plan. With business services as critical as CRM, Email, Project Management, Invoicing, you have virtually STOPPED business for so many people where every minute costs money. I cannot believe that you could have let this happen. The poor planning makes me wonder what else was not well thought out in your platform and services. ZoHo Trust = zero. You get what you pay for, or maybe a little less than you paid for in this case. ZoNo....

  29. This sort of things should never happen but then again, who really expects a Tier 1 colocation provider like Equinex to lose power (including backup) at once and without warning. It reminds back to several months ago when Amazon's S3 had a major outage. Equinex is a state of the art facility which comes at a premium, if anything it reflects poorly on them for not living up the the SLA's. I am a Zoho customer and it's really unfortunate that we have to end the week without our activities, but I understand the issue so I can't be too upset. Sure, a mirrored site on the East Coast would be great for times like these, but there's a reason why Zoho is 70% cheaper than Salesforce... just saying.

  30. If the generator doesn't start, it doesn't matter how many gallons of fuel you have. Happened to us, so we moved to a different data center provider.An inexpensive "solution" for this specific issue is to deploy managed in-rack UPS to support the database cluster servers. If power to the rack is lost and not restored in a timely basis, the in-rack UPS will perform a graceful shutdown of the entire database cluster.Deploying near-real-time WAN replication of busy database clusters is neither trivial nor inexpensive; would you pay double what you pay Zoho now for the benefit of geographic redundancy?Perhaps giving customers a choice, for an additional price, to be hosted on a geographically diverse Zoho cluster versus the single-site cluster would empower customers to choose for their own how much "insurance" they want to buy. For us, we would not, but we recognize that CRM for us is much less mission-critical than it is for others.Hope that helps,
    Mark

  31. I am on Mark Stones side. Take the extra time to make sure the databases are re-indexed correctly.We have web-based software housed in servers at a data center. This software manages and records video from remotely located IP cameras. Video as a Service. Should a hard server shut down occur, it is vital to make sure that the tables are re-indexed correct to prevent any data loss. I am sure that you folks are are working as fast as possible to restore service.

  32. Quite inconvenient, but the Twitter & Blog updates are highly appreciated. I look forward to reading your full report later on what exactly happened.

  33. Time to re-balance your manpower. Less PR, more basic backbone support. I've recommended Zoho to many who were looking at http://salesforce.com " rel="nofollow">salesforce.com. Needless to say, I've been personally receiving emails blasting me for telling them about Zoho. While problems occur with all of our businesses, we should be well prepared for those we view as possible or likely. For a cloud application provider, it would seem insuring up-time and managing a data center going down in any variation would be at the top of the list. I hope you don't mistake the discontent and confidence hit this has brought. And if people want to leave, you best make it easy to do so. If people start complaining they can't get their data if they want to go, it will simply cause a deeper confidence crisis. And without question - you should look to generous credits to help appease the disgust and discontent.We're all watching and hoping this is a huge lesson that propels you to be a far better service-oriented company. And not one that underestimates the importance of reliability, stability and candid and timely service.

  34. Todays outage clearly shows that mirroring is not FAULT TOLERANCE or a Disaster Recovery Strategy. It is in these exact situations that we must realize that all backup solutions are truly RESTORE solutions.
    It is not just about protecting data, but having the ability to provide FAULT TOLERANCE, DATA REPLICATION and SITE FAILOVER. It is about understanding and protecting all points of failure and having a strategy in place to recover.
    We are housed in a Data Center that offers 7 points of communication and a 50,000 Gallon container of Diesel fuel to avoid any power outages.

  35. I just switched over from http://salesForce.com " rel="nofollow">salesForce.com's CRM system. If you have the available funding I highly recommend them for CRM. This is my first bad experience with Zoho. I hope I did not make a mistake in migrating to Zoho. Regards,

  36. Yea this really sux because I was expecting an email back today for a job position. But I understand this isn't entirely Zoho's fault. I am trying to figure out why Equanix did not have layers of UPS boxes for their data servers. Maybe they did and those failed too or maybe I am just clueless about how this all works. Anyway I just feel bad for all those companies that rely on Zoho's service. A lot of productivity (money) can be lost in one business day due to something like this. Thank God its the weekend. But I think Zoho is doing good updating us on the situation. I guess I shall patiently hope that I can check my email tommorow.

  37. Kent - I've been with Zoho for 3 years now and this is the FIRST time I have ever experienced down time with them. Hang in there!

  38. In this enviroment for cloud/hosting based services, it is almost unbeliveable that you do not have redundancy. Obviously, it was not your power failure that caused this issue, however as being a major player your NOC should have demanded that Equinox do mandatory failovers (of all failure points) at least once a month. I have had major issues with Zoho Support through out my company's relationship with you and it seems that your company is really not ready for prime time just yet. When everything is working, you have a wondeful service, however its a hard sell to keep putting our workflow on your platform.

  39. It is quite disappointing that you have not engineered in redundancy.With millions of paying customers depending on you, you would think that you would have a more fault-tolerant system. Its not hard to design and needs to be in place. Otherwise, Zoho is not really taking proper care of its responsibility as an application service hosting provider.

  40. It is what it is, so I guess we have to suck it up. I find it interesting that in this day and age Zoho does not have a back-up plan for such occurrences. Hmmm...

  41. I just got our company up and running on Zoho. Our company can not be subjective to these types of failures. Does any one know of a comparable CRM system I should consider switching to.

  42. @CC
    Apologize for this. In the process of restoring the services, we noticed database inconsistencies. That pushed us back to the table to recheck all databases.This is certainly not one of our best days. We are going to reevaluate things at our end and make some progress.

  43. This reflects very poorly on you guys.You have to engineer in advance for redundancy... it's incredible to me that power issues in a single data center could bring down *all* of your services for a whole business day.And let me also add, very disappointing that your updates throughout the course of the day have been so misleading. We've been given soft deadlines NUMEROUS times on when services would be coming back up... and you've blown them all. Do you realize some of us have to plan our day around your promises? Can we get some accurate projections??

Leave a Reply

Your email address will not be published.

The comment language code.
By submitting this form, you agree to the processing of personal data according to our Privacy Policy.

Related Posts