Covering Scientific & Technical AI | Sunday, December 1, 2024

Making SQL Servers Resilient in the Cloud 

In IT, failure is not only an option, it’s inevitable. And in an era when businesses depend on big data, high-performance data analytics (HPDA), machine learning and AI applications running in the cloud, failures can have serious consequences.

Failure per se is not the problem. Downtime – i.e., minimizing downtime – is the problem. There are ways to prevent failures, both isolated and widespread, from causing unacceptable periods of downtime. In fact, there are so many high availability (HA) and disaster recovery (DR) options available that the challenge is making the optimal choice for each application.

The Six Imperatives of IT Resiliency

Business continuity strategies require planning, and these six imperatives of IT resiliency from Deloitte Consulting LLC provide useful guidance:

  1. Defend – Minimize downtime caused by any failure, mostly by eliminating single points of failure.
  2. Detect – The sooner a failure is detected, usually with continuous monitoring, the sooner remedial action can be taken.
  3. Remediate – Upon detecting a failure, either an automatic or manual response is needed to activate standby resources.
  4. Recover – Recovery for critical applications should occur rapidly and fully (ideally automatically and to pre-disruption performance levels), and with little or no data loss.
  5. Diagnose – After an incident, conduct a postmortem analysis to identify the root cause(s).
  6. Refine – All failures are learning experiences that can inform necessary improvements resiliency provisions.

The HA and DR provisions needed for all application are determined by their recovery time and recovery point objectives. For applications that must recover quickly with little or no data loss, robust HA provisions with automatic failover to a “hot standby” instance are needed. Those applications that can tolerate some downtime and data loss normally require only basic DR provisions with manual failover to “cold” or “warm” standby resources.

IT Resiliency in the Cloud

Microsoft Azure, Amazon Web Services (AWS), the Google Cloud Platform (GCP) and other cloud service providers (CSPs) all offer standard and optional ways to maximize the availability of services. Their service level agreements (SLAs) usually offer money-back guarantees for uptime falling short of a designated percentage. For HA, four-nines of uptime (99.99 percent) should be considered the goal, and that normally requires spanning multiple availability zones or regions to protect against failures at the data center level.

The SLA guarantees cover “dial tone” at the server level only, however, and explicitly exclude many causes of downtime at the application level. Customers are, therefore, responsible for implementing the HA provisions needed to ensure satisfying the demanding RTOs and RPOs of mission-critical SQL Server databases. For DR needs, where RTOs and RPOs are typically less stringent, the CSP’s DR-as-a-Service (DRaaS) or DIY (Do-It-Yourself) DR offerings are usually suitable for most applications.

In addition to the CSP’s IT resiliency provision, SQL Server offers two of its own HA/DR features: Failover Cluster Instances and Always On Availability Groups. FCIs afford two notable advantages: inclusion in the less expensive Standard Edition and protection of the entire SQL Server instance. A significant disadvantage is the need for cluster-aware shared storage, which has historically been unavailable in the cloud. The lack of shared storage was addressed in Windows Server 2016 Datacenter Edition with the introduction of Storage Spaces Direct, but S2D also has limitations, especially the need for both the active and standby instances to reside within the same data center.

SQL Server’s other HA/DR feature, Always On Availability Groups, is a more robust solution capable of satisfying an RTO of 5-10 seconds and an RPO of seconds or less. Among its other advantages are inclusion in SQL Server 2017 for Linux, no need for shared storage and readable secondaries for queries (with appropriate licensing). But for Windows it requires licensing for the more expensive Enterprise Edition and it lacks protection for the entire SQL Server instance.

The limitations associated with these options has created a need for third-party failover clustering solutions purpose-built to provide HA/DR protections for virtually all Windows and Linux applications in private, public and hybrid cloud environments. Such offerings afford two major advantages: no need for shared storage (ideal in the cloud) and application-agnosticism, which eliminates the need to have different HA/DR provisions for different applications.

These software-only solutions include, at a minimum, real-time data replication, continuous monitoring able to detect failures at the application level, and configurable policies for failover and failback. Most also offer a variety of value-added capabilities, including some specific to Linux and popular applications like SQL Server.

Being purpose-built for HA, failover clusters are able to satisfy RTOs as low as 20 seconds with no data loss (an RPO of zero), enabling them to deliver an uptime of 99.99 percent. One notable disadvantage is the inability to read secondary instances of SQL Server databases. But given their capacity to detect, remediate and recover from failures at the application level, purpose-built failover clustering solutions are becoming increasingly popular for critical database and other applications running in the cloud.

Refining (#6) the Defense (#1) of SQL Server Databases

For mission-critical SQL Server databases running in the cloud, there are really only two viable options for assuring the 99.99 percent uptime required for HA: SQL Server’s own Always On Availability Groups and third-party failover clusters. True, both options do increase both capital and operational expenditures. But when weighing that relatively modest increase against the cost of downtime and the savings afforded by the cloud, a compelling case can be made for making refinements in how you now defend your organization’s mission-critical applications.

David Bermingham is technical evangelist at SIOS Technology.

AIwire