High Availability with SQL Server 2008 R2
Applies to: SQL Server 2008 and SQL Server 2008 R2
This white paper describes the technologies available in SQL Server 2008 that can be used as part of a high-availability strategy to protect critical data. As well as describing the technologies in detail, the white paper also discusses the various causes of downtime and data loss, and how to evaluate and balance requirements and limitations when planning a high-availability strategy involving SQL Server 2008.
This white paper is targeted at architects, IT pros, and database administrators (DBAs) tasked with implementing a high-availability strategy. It assumes the reader is familiar with Windows and SQL Server and has at least a rudimentary knowledge of database concepts such as transactions.
Today many companies require some or all of their critical data to be highly-available. For example, a company requiring “24×7” availability is an online merchant, whose product databases and online sales applications must be available at all times; otherwise sales (and revenue) are lost. Another example is a hospital, where computerized patient records must be available at all times or a human life could be lost.
In a perfect world, this critical data would remain available and nothing would threaten its availability. In the real world, however, there are numerous problems that can cause data to become unavailable. If high availability of the data is required, a proactive strategy must be formulated to mitigate the threats to availability—commonly called a “high-availability strategy”.
Such strategies always call for the implementation of multiple technologies that help maintain data availability—there is no single high-availability technology that can meet all requirements. The Microsoft® SQL Server® 2008 data management system includes a variety of technologies that can be used to increase and/or maintain high availability of critical data. This white paper will introduce these technologies and describe when and how they should be used.
It should be noted here that high availability is not the same as disaster recovery, although the two terms are often (erroneously) interchanged. High availability is about putting a set of technologies into place before a failure occurs to prevent the failure from affecting the availability of data. Disaster recovery is about taking action after a failure has occurred to recover any lost data and make the data available again.
Before describing SQL Server 2008 high-availability technologies and their uses, the white paper will discuss the causes of downtime and data loss, to allow the usefulness and applicability of the technologies to be shown. Furthermore, the white paper will also present guidelines on how to approach planning for high availability—including what requirements to gather and what limitations to be aware of before considering any technologies.
Finally, after the technologies have been described, the causes of downtime and data loss will be revisited with a list of which technologies can be used to mitigate each cause.
Many of the technologies discussed are only available in SQL Server 2008 Enterprise, while others have limitations in SQL Server 2008 Standard (or lower). A summary table of edition support for the features discussed in the white paper is included in the final section. Furthermore, the SQL Server Books Online topic “Features Supported by the Editions of SQL Server 2008” (http://msdn.microsoft.com/en-us/library/cc645993.aspx) has a comprehensive list.
Causes of Downtime and Data Loss
The main point of a high-availability strategy is to keep the critical data as available as possible in the event of a failure or disaster. Failures, by their very nature, are not planned events and so it is hard to predict when one will occur and what form it will take. There are also times when planned events may occur that affect data availability if preventative steps have not been taken. This section of the white paper will discuss the various events that may occur that can cause downtime and data loss, but it will not discuss the mitigating technologies.
There are three planned activities that can cause downtime—performing database maintenance, performing batch operations, and performing an upgrade. Neither of these activities results in data loss.
Database maintenance can cause downtime if an operation is performed that needs to place a lock on a table for an extended period of time. Examples of such operations include:
Creating or rebuilding a nonclustered index (can prevent table modifications)
Creating, dropping, or rebuilding a clustered index (can prevent table reads and modifications)
Performing batch operations can cause downtime through blocking locks. For example, consider a table that holds the year-to-date sales for a company. At the end of each month, the data from the oldest month must be deleted. If the number of rows being deleted is large enough, a blocking lock may be required, which prevents updates to the table while the delete operation is being performed. A similar scenario exists where data is being loaded into a table from another source.
Performing an upgrade always incurs some amount of downtime, because there comes a point where the application must disconnect from the database that is being upgraded. Even with a hot standby technology (such as synchronous database mirroring), there is (at best) a very short time between the application disconnecting and then connecting to the redundant copy of the database.
In both cases, planned downtime can be minimized using the technologies in SQL Server 2008.
Unplanned Downtime and Data Loss
An even larger number of failures exist that can cause downtime and data loss. These can be grouped into several broad categories:
Data center failure: This category of failures takes the entire data center offline, rendering any local redundant copies of the data useless. Examples include natural disasters, fire, power loss, or failed network connectivity.
Server failure: This category of failures takes the server hosting one or more SQL Server instances offline. Examples include failed power supply, failed CPU, failed memory, or operating system crashes.
I/O subsystem failure: This category of failures involves the hardware used to physically store the data and thus directly affects the data itself—usually causing data loss (which then may lead to downtime to perform disaster recovery). Examples include a drive failure, a RAID controller failure, or an I/O subsystem software bug causing corruption.
Human error: This category of failures involves someone (a DBA, a regular user, or an application programmer introducing an application software bug) making a mistake that damages data, usually causing data loss (and then potentially downtime to recover the data). Examples include dropping a table, deleting or updating data in a table without specifying a predicate, setting a database offline, or shutting down a SQL Server instance. Human errors are usually accidental, but they can also be malicious.
In all cases, data loss can be prevented and downtime can be minimized using the technologies in SQL Server 2008.
Planning a High-Availability Strategy
A successful high-availability strategy cannot be planned solely from the technical standpoint, because the costs and risks to the business from downtime and/or data loss must be understood. Similarly, a strategy cannot be driven solely from business requirements without an understanding of the technical implications of those requirements.
The first answer from many people when planning a high-availability strategy is something like “implement failover clustering!” without any consideration of what the strategy is trying to achieve. Although failover clustering is an excellent technology, it is not appropriate in all situations, so it is important to pick the right technologies rather than just the first one that comes to mind. (Again, note that a successful strategy always includes multiple technologies.)
Being able to pick the right technologies means understanding not only the characteristics of the technologies but also the prioritized list of requirements, taking into account any limitations that exist.
Requirements are an important part of the design and testing phases. As part of the overall design process, requirements are identified. Next, the design must be validated against the requirements prior to implementation, and finally, they must be used to test the design after it is implemented.
The aim of the requirements gathering process is to generate a prioritized list of what data needs to be protected and to what level. For example, the process should take into account system components, the amount of time and data the organization can afford to lose, and performance.
It is important to consider the “application ecosystem”— the data components of the system that must be protected. This may be as simple as a few rows in a table or as complicated as multiple databases on different SQL Server 2008 instances, plus appropriate SQL Server Agent jobs, security settings, stored-procedures, and more. The more complicated the application ecosystem, the more difficult it is to ensure everything is made highly available.
Of course, it is likely that the overall high-availability strategy will encompass protecting multiple resources with different requirements and against different causes of downtime and data loss.
The two main requirements around high-availability are commonly known as RTO and RPO. RTO stands for Recovery Time Objective and is the maximum allowable downtime when a failure occurs. RPO stands for Recovery Point Objective and is the maximum allowable data-loss when a failure occurs. Apart from specifying a number, it is also necessary to contextualize the number. For example, when specifying that a database must be available 99.99% of the time, is that 99.99% of 24×7 or is there an allowable maintenance window?
A requirement that is often overlooked is workload performance. Some high-availability technologies can affect workload performance when implemented (either outright or when configured incorrectly). Also, workload performance after a failover must be considered—should the workload continue with the same throughput as before, or is some temporary degradation acceptable?
Some examples of requirements are:
Database X must be available as close to 24×7 as possible and no data loss can be tolerated. Database X also relies on stored procedures in the master database and must be hosted on a SQL Server 2008 instance on a server in security domain Y.
Tables A, B, and C in database Z must be available from 8 A.M. to 6 P.M. on weekdays, no data loss can be tolerated, and they must all be available together. After a failover, workload performance cannot drop.
Limitation analysis is crucial to prevent wasted time (and money) from designing a strategy that cannot be implemented (for example, the strategy involves adding a data center, but the total budget is US$10,000 per year). Limitations are not just in terms of budget (although that is usually the overriding limitation)—there are other nontechnical limitations, and a host of technical limitations to consider.
The nontechnical limitations include:
Power (for more servers, disks, and associated air conditioning)
Space (for more servers and ancillary equipment)
Air conditioning (to cope with all the extra heat output from added equipment)
Manpower (to install and maintain any added systems and equipment)
Time to implement (the more complex the design, the longer it takes to implement)
Politics and/or management issues (if multiple teams are involved)
The technical limitations can more directly affect which SQL Server 2008 technologies you can choose to incorporate in the high-availability strategy. Some important questions to ask are listed here, and the ramifications of the answers are explained in more detail throughout the rest of the white paper.
What recovery model is being used currently? Database mirroring, log shipping, and the ability to take transaction log backups are not available if the simple recovery model is used. Switching out of the simple recovery model requires that transaction log management be performed to prevent the transaction log growing out of control.
What is the transaction log generation rate of the workload? This impacts the management of the transaction log itself, especially because some technologies require the full recovery model (for example, database mirroring). If the recovery model has to change, many things may be affected, such as the size of transaction log backups or the ability to perform faster batch processes and index maintenance operations.
How large are the average transactions? Large (and usually long-running) transactions can significantly affect the ability to quickly bring a redundant copy of a database online after a failover, because any uncommitted transactions must be rolled back to make the database transactionally consistent. This may prevent the RTO from being achieved. Also, if synchronous database mirroring is being considered, the network latency and its effect on transaction response time should be taken into account.
What is the network bandwidth and latency to the remote data center? Poor bandwidth and/or high latency affect the ability to implement any technology that copies data, transaction log content, or backups between data centers.
Can the application be altered? With any redundancy technology, the application must reconnect to the server (or maybe another server) after a failover occurs, and potentially restart a failed transaction. These are much easier to accomplish if the application can be altered.
Is the FILESTREAM feature required in the application? The FILESTREAM feature cannot be used in conjunction with database mirroring or database snapshots in SQL Server 2008. Also, depending on how FILESTREAM is used, it may cause problems with backups, log shipping, and transactional replication. This is explained in more detail in the white paper “FILESTREAM Storage in SQL Server 2008” (available at http://msdn.microsoft.com/en-us/library/cc949109.aspx).
Are there any hardware or software limitations? Hardware limitations (in terms of CPU, memory, I/O subsystem, or disk space) can affect database mirroring, multi-instance failover clustering, RAID level, and workload throughput. The available edition of SQL Server 2008 affects which high-availability technologies are able to be used (as mentioned in the Introduction).
After all the requirements and limitations are known, realistic analysis is needed to ensure that the original requirements can be met or altered to meet identified limitations. Compromises may involve relaxing a requirement or removing a limitation, or simply designing a strategy that meets as many of the requirements as possible with defined and documented limitations. Management should also be made aware of these limitations and approve of the compromises – otherwise additional budget, time, or analysis might be needed to reach higher levels of availability.
Whatever form the compromise takes, it is essential that all interested parties agree on the compromise. And again, it is also essential that management is aware of which requirements cannot be met and why before a failure occurs, so that no one is surprised.
After the post-compromise requirements are known then, and only then, is the time to evaluate the various technologies. There is no point evaluating and picking technologies based on a flawed set of requirements and then having to repeat the process after the requirements are better understood.
Similarly, success does not come from picking an unsuitable technology and then trying to make it perform a function for which it was not designed. For example:
Log shipping is unsuitable for a design that requires zero data loss.
Database mirroring is unsuitable for a design that requires multiple databases to fail over simultaneously.
When you evaluate technologies, it is important to consider all aspects of the technologies, including:
The monetary cost of implementing the technology.
The complexity of implementing, configuring, and managing the technology.
The technology’s impact on workload performance (if any).
The data loss exposure if the technology is used.
The downtime potential if the technology is used.
SQL Server 2008 High-Availability Technologies
SQL Server 2008 has a large number of technologies to aid in maintaining high availability. The easiest way to describe these is to start with logging and recovery, backups, and backup strategy—the building blocks of any high-availability strategy. This provides the basis to then explain the other technologies that are applicable if only a single instance of SQL Server 2008 can be used, and then the multi-instance technologies, which people usually think of as the high-availability technologies in SQL Server.
Logging and Recovery
After the post-compromise requirements are known then, and only then, is the time to evaluate the various technologies. There is no point evaluating and picking technologies based on a flawed set of requirements and then having to repeat the process after the requirements are better understood.
The fundamental mechanisms that allow SQL Server 2008 (and all other relational database management systems in the industry) to keep data transactionally consistent are logging and recovery.
In this section these two mechanisms will be briefly described. They underpin nearly all the high-availability technologies in SQL Server 2008. For a much more in-depth description of these mechanisms, and a description of the architecture and behavior of the transaction log itself, see the TechNet Magazine article “Understanding Logging and Recovery in SQL Server” (available athttp://technet.microsoft.com/en-us/magazine/2009.02.logging.aspx).
All changes to a database are logged as they are made. Logging a change means describing exactly what changed in the database and how it was changed—in a structure called a log record. Log records are stored in the transaction log for each database and must remain available until they are no longer required by SQL Server. Some common reasons for log records being required include:
A transaction log backup needs to be taken.
A full or differential backup requires the transaction log.
Database mirroring has not yet sent the log records to the mirror database.
Transactional replication has not harvested the committed log records.
Change data capture has not harvested the committed log records.
Uncommitted transactions exist which may be rolled back.
The primary reason that logging takes place is to allow crash recovery to happen (sometimes called restart recovery or simply recovery).
Recovery is the mechanism by which a database is made transactionally consistent. It occurs when a database is brought online after being shut down with active transactions running (for example, after a SQL Server 2008 instance fails over to another node in a failover cluster). It is also required as part of restoring backups (whether manually or as part of log shipping), running database mirroring, and creating a database snapshot.
Recovery has two phases—redo and undo. The redo phase ensures that all committed (that is, completed) transactions are wholly reflected in the database. Redo is usually faster than undo (but of course it is very variable). The undo phase ensures that all uncommitted (that is, active but not completed) transactions are not reflected in the database in any way whatsoever. In other words, the database is made transactionally consistent.
There is a special case of undo—when a transaction is rolled back during the course of normal operations. In that case, all changes made by the transaction are reversed. This is commonly known as rolling back or aborting a transaction.
A database cannot be made available until recovery has completed, with two exceptions. In SQL Server 2008 Enterprise, the database is made available after the redo phase has completed during regular crash recovery or as part of a database mirroring failover. In these two cases, the database comes online faster and downtime is reduced. This feature is called fast recovery.
Backup, Restore, and Related Technologies
Any high-availability strategy must include a comprehensive backup strategy. Even if the high-availability strategy includes technologies to provide redundant copies of the databases being protected, there are several reasons why backups should also exist:
The failure may affect the redundant copies of the databases as well as the primary copies. In this case, without backups to allow restoration, the data may be completely lost.
It may be possible to restore a small portion of the databases, which may be more palatable than failing over. Furthermore, it may be possible to restore that small portion of the database without affecting the portion being used by the application—effectively recovering from the failure with zero downtime.
One mistake that many people make is to plan a backup strategy without carefully considering the consequences of having to perform restore operations. It may seem that the backup strategy is working correctly but then a failure occurs and it turns out that the existing backups do not allow the best restore operations to be performed.
For example, a backup strategy that includes a weekly full database backup plus transaction log backups every 30 minutes seems at first glance to be a sound strategy. This scenario is illustrated in Figure 1.
Figure 1: Backup strategy with full database and transaction log backups
However, consider the case in which a failure occurs just before the next full database backup is due to occur. The restore sequence is the previous full database backup, plus all transaction log backups since then (approximately 350!).
Although this strategy would allow recovery up to the point of the failure with no data loss, it would involve a significant amount of downtime, because all the transactions since the full database backup would effectively have to be replayed (as part of restoring the transaction log backups).
A better backup strategy, which would allow recovery with no data loss and a lot less downtime, may be to add in periodic differential database backups. A differential database backup contains all parts of the database that have changed since the previous full database backup. This is effectively the same information contained in all the transaction log backups since the previous full database backup, but is far faster to restore. This strategy is illustrated in Figure 2.
Figure 2: Backup strategy involving full database, differential database, and transaction log backups
If a failure occurs just before the next full database backup, this strategy allows the restore sequence to be the previous full database backup, the most recent differential database backup, and then all the transaction log backups since the most recent differential database backup. This is significantly fewer backups to have to restore. Of course, if the change rate of the database is very high, it may make more sense to implement more frequent full database backups instead.
The point of this example is that it is always more prudent to plan a restore strategy than it is to plan a backup strategy—work out what restore operations are required, and then let that information dictate the backup strategy. The best backup strategy is the one that best supports the restore operations the organization requires.
The details of how backup and restore operations work are beyond the scope of this white paper, but there is a comprehensive TechNet Magazine article that goes into much greater depth:
“Understanding SQL Server Backups” from the July 2009 issue (available athttp://technet.microsoft.com/en-us/magazine/2009.07.sqlbackup.aspx)
SQL Server 2008 Books Online also has a large amount of information, starting with the topic “Backing Up and Restoring Databases in SQL Server” (available athttp://msdn.microsoft.com/en-us/library/ms187048.aspx).
Partial Database Availability and Online Piecemeal Restore
SQL Server 2008 includes many technologies that can dramatically reduce the time required to perform a restore operation. The most important of these technologies is partial database availability—the ability to have parts of the database offline and continue processing using the online portions.
This ability is based on having the various portions of data stored in separate filegroups. As long as the primary filegroup and the transaction log are online, the database itself is online and all other online filegroups can be used for regular processing.
For example, consider an auto parts sales database that consists of 5 sets of data—one set of tables that stores details of the various parts, and one set of tables for each US state in which the company operates. If all of the tables are stored in a single filegroup, and if that filegroup is offline, the entire sales application is also offline. If each state’s tables are stored in a separate filegroup—manual partitioning of your data by region—the loss of a single filegroup may not take the entire sales application offline. Figure 3 shows this scenario.
Figure 3: An example of manual partitioning
Continuing the example, if any of the state filegroups are offline, only sales to that state are affected. It is only if the Parts filegroup goes offline that the entire sales application may become unavailable, depending on the application logic. The filegroup that is damaged can be restored while the rest of the database is being used for processing, using a feature called online piecemeal restore.
The second way of splitting up the database is to use horizontal partitioning of tables and indexes. In this case, a single table can be split into partitions and each partition stored in a separate filegroup (or groups of partitions stored in a single filegroup – for example all months within a filegroup that contains an entire year, if the table contains data from multiple years).
As an example of horizontal partitioning, consider a sales database with a single sales table, containing all sales from the last year-to-date. If the table is stored in a single filegroup, the sales application is offline if that single filegroup is offline. However, if the table is split into multiple data ranges (for example, a calendar month in each range), with each range contained in a separate partition stored in its own filegroup, if one filegroup is offline, the sales application might not be affected at all. If the filegroup containing the current month partition is offline, no new orders can be processed, but older orders can be examined. If one or more of the filegroups containing prior month partitions is offline, new orders can be processed, and only the ability to examine older orders is compromised.
In either case, the sales table as a whole remains available for processing, with just the offline portions unavailable. As with manual partitioning, the offline portions of the sales table can be restored online while the other portions are available for processing—making use of online piecemeal restore.
This strategy also allows better control of how hardware resources are allocated. Because current sales are the most critical, these can be placed on storage arrays with better redundancy, while older sales data can be placed on storage arrays with less redundancy—saving money overall.
For more information about using partitioning, see the white paper “Partitioned Table and Index Strategies Using SQL Server 2008” (http://msdn.microsoft.com/en-us/library/dd578580.aspx) and the SQL Server Books Online topic “Partitioning” (http://msdn.microsoft.com/en-us/library/ms178148.aspx). For more information about partial database availability, see the white paper “Partial Database Availability” (http://download.microsoft.com/download/d/9/4/d948f981-926e-40fa-a026-5bfcf076d9b9/PartialDBAvailability.doc). For more information about online piecemeal restore, see “Performing Piecemeal Restores” (http://msdn.microsoft.com/en-us/library/ms177425.aspx) in SQL Server Books Online.
As one more example of piecemeal restore, it is also possible to restore a single database page from a backup, as long as transaction log backups exist from the time of the last full database backup up to the current time. The ability to restore only that page (or pages) that are damaged, and to do it online, further reduces the downtime required to recover from a failure.
For more information about single-page restore, see “Performing Page Restores” (http://msdn.microsoft.com/en-us/library/ms175168.aspx) in SQL Server Books Online.
Instant File Initialization
When a restore operation is performed, the first phase of the restore is to create the database files being restored (if they do not already exist). The second phase is for SQL Server to zero-initialize the newly created database data files for security purposes. The restore operation cannot continue until the zero-initialization completes.
Zero-initialization involves sequentially writing blocks of zeroes into the data files to overwrite the previous contents of that area of the NTFS volume. The time required for this initialization is therefore directly proportional to the data file size. If the restore operation is being performed as part of disaster recovery, and the database is offline while the restore is being performed, the zero-initialization process could prevent the RTO from being reached.
For example, with a 1-terabyte data file and a backup containing 120 gigabytes (GBs) of data, the entire 1-terabyte data file must be created and zero-initialized, even though only 120 GBs of data will be restored into it.
In all editions of SQL Server 2008, the zero-initialization process can be skipped entirely using a feature called instant file initialization. Through the granting of a Windows® privilege to the SQL Server service account, the SQL Server instance creates the file and instructs NTFS that it does not need to be zero initialized. This change can realize a dramatic drop in the time required to perform a restore operation, if the database data files must be created as part of the restore. This can also help during database mirroring setup, when the initial full database backup is restored.
After it is enabled, this feature also allows regular creation and expansion of database data files to skip the zero-initialization process, thus providing further availability gains. It should be noted that this feature does not apply to the database transaction log file—transaction log files must always be zero initialized to guarantee correct operation of the crash recovery process.
For more information about this feature, see “Database File Initialization” (http://msdn.microsoft.com/en-us/library/ms175935.aspx) in SQL Server Books Online.
It is always advisable to have multiple copies of backups, in case a backup becomes damaged in some way by an I/O subsystem failure. For instance, if only a single copy of a damaged transaction log backup exists and there is no other backup spanning that time period, the database c