AI disaster recovery in Zoho

Even the most robust systems fail from time to time. There's no getting around it. Despite standardised technical frameworks and structures being established to avoid failure and minimise risk, external factors like power outages, unexpected network partitions or catastrophic calamities can severely impact operations.

These events may be infrequent, but they are inevitable. So now, the question isn't whether disruption can be fully prevented - the question is if it can be effectively contained.

Zoho's approach to business continuity and disaster recovery (BCDR) spans the entire lifecycle - before, during, and after disruption; and it is grounded in four key elements: resilience, recovery, contingency, and continuous improvement.

For starters, customer data is distributed and replicated dynamically across geographically- apart data centres to ensure high availability. If there's a disruption to the primary data centre and it becomes unavailable for some reason, traffic is immediately rerouted to the secondary data centre, with practices set in stone for organised and seamless fail overs.

Data backup and recovery are critical components for data recall during disruptions. At Zoho, incremental backups are carried out regularly for audit logs and critical files. This data is then stored in secure, geographically isolated locations to ensure recoverability.

By combining data replications across regions with the stringent maintenance of standby environments, fail overs can be executed within defined timeframes, typically within hours, with almost minimal to no service interruption.

During such events, Zoho also prioritises effective internal communication to ensure that customers are informed about the situation, that they feel supported and reassured, and most importantly, that they feel safe.

Critical updates are shared at regular intervals. Post resolution, a detailed report is curated that details the causes of the occurrence of disruption, how it was handled and managed, and the subsequent processes installed to prevent such a recurrence in the future. This report is shared with the customer to keep them duly informed, and enable full transparency.

This architecture is aligned with the defined RTO and RPO metrics, which are included as a part of our standard solution design and contractual commitment.

To expand a little,

  • RTO (Recovery Time Objectives) defines the maximum acceptable time to restore the availability of service following a disruption.
  • RPO (Recovery Point Objectives) defines the maximum acceptable data loss, measured between the last time the system was working consistently to the point of failure.

Recovery of AI systems

When it comes to AI system recovery, the objective remains the same - to ensure the continued functional operation of systems, and to minimise loss of data. But it introduces additional layers of dependency that extend beyond the traditional disaster recovery models. Recovery is no longer restricted to restoring infrastructure, as it was with traditional systems. But it must also account for data pipelines, model states and inference behaviour.

Unlike conventional services, AI systems are tightly intertwined with continuously learning and evolving datasets and model versions. This requires continuity planning to be more vigilant and agile, requiring coordination across components manifold, as opposed to the recovery of an isolated system.

To address this, Zoho's AI services are deployed on robust, multi-region infrastructure with automated failover mechanisms. In case of a disruption, inference endpoints are immediately are redirected through failover to an alternate region. Model artifacts are replicated across geographies to ensure availability.

Training starts from scratch, pulling data again from the CRM system, which ensures that the core AI functionalities such as predictions and recommendations remain available.

LLM model recovery

Recovery AI systems requires more than reactivating inference endpoints, it requires restoring models to a known, validated state.

All models are maintained through version-controlled pipelines, each model associated with its training configuration, data sets and evaluation metrics.

When a disruption occurs, the latest stable model version is restored for inference. Any training jobs mid-progress are resumed with persistent checkpoints and datasets. If check-points are unavailable training pipelines are re-triggered using preserved datasets.

Additionally, a validation layer is applied before reactivation to confirm that all the restored models meet a set standard of performance thresholds.

Testing and validation of AI recovery systems

Given the complexity of AI systems, there is no scope for assumption. Recovery strategies must be regularly and stringently tested and validated.

Zoho conducts periodic disaster recovery drills that include failover simulations specifically for AI. These exercises are designed to

  • Validate the restoration of model artifacts
  • Check the availability of inference endpoints
  • Check the recovery end-to-end data and training pipelines

These tests go beyond just verifying system availability. Post-recovery and key performance indicators are evaluated regularly to ensure that AI systems remain reliable and consistent. This includes:

  • Prediction latency
  • Stability of outputs against the baseline,
  • Integrity of restored models

By validating both infrastructure and model behaviour, under failure conditions, these drills ensure that recovery processes remain predictable, consistent, and aligned with expected performance standards.

PREVIOUS

UP NEXT