High Availability Test Plan of a BizTalk Solution

I’m start to write a new document template as my last, BizTalk Solution Assessment and Review Template, these are the major tracks of the next document, High Availability Test Plan Template

This is a first part of the most important topics, this first part covers a little introduction and the big areas.


Introduction

High availability testing is targeted toward finding bugs/issues that affect the availability of the product. High Availability involves testing how the system behaves under failover situations.

System failover can be caused by 2 methods.
A. Controlled failover: Occurs when the system is taken down for maintenance. E.g. manually failover resources to a known good node.
B. Uncontrolled failover: Occurs when the system fails due to some unforeseen circumstances. E.g. Power outage, Network outage, Application crash, Slow or sluggish response due to depleted resources.
Testing in the High Availability area can be broken down into 3 categories:

A. Deployment.

B. Running the application for extended periods.

C. Simulating failover. E.g. fault injection.

High Availability testing should not be confused with Stress or Performance testing. Although generating stress against a system that is deployed to run in a Highly Available scenario is useful to instrument system behavior, it is not imperative.
Tracking performance, although not the primary objective, is important in detecting scalability issues. What is important is the behavior of the system under failover situations.

Normally areas to focus on are:

A. Time between failover and recovery.

B. Data integrity. E.g. is the atomic transaction saved in its entirety? If not, is it rolled back?

C. State/Session management. E.g. is the shopping basket lost? Are credentials persisted?

D. Recovery. E.g. performing a long task, does the system restart or continue?

Highly available systems are setup using either NLB [Network Load Balancing], MSCS [MS Cluster Service] or both, preferably running on Data Center versions of Windows. There are clear distinctions between NLB and MSCS.

A. NLB is a scale out mechanism for handling load. Prime candidates for use of NLB are Web farms. Adding more web servers to an NLB cluster helps scale out.

B. MSCS is a highly available setup for handling failover. Prime examples of MSCS use are clustered backend systems such as SQL, Exchange or Commerce Server.

C. CLB is recommended for middle tier computers. This is deployed on “Application Servers”. However, applications have to be architected to take advantage of component load balancing.


The Areas

Deployment:

Setting up High Availability test beds are more time consuming since it involves setting up multiple machines to run as part of a group. Hardware requirements are also demanding. These high performance backend systems should usually be at least quad-proc with 4GB RAM and better. MSCS clustered systems should be setup to use RAID5 SCSI drives with dual NIC’s. Once these test beds are setup they should be dedicated for High Availability testing.

Deployment on High Availability test beds can also be a challenge, since setting up images are difficult due to the fact that MSCS clustered machines maintain state on shared resources [SCSI drives]. Thus deployment on “High Availability” test beds will require customized automated scripts that handle the cleanup [e.g. restoring a backup of a pristine database] of the state maintained on shared resources.

Coverage:

Considering the complexity and variance of functionality of the servers, we will need clear guidelines from the PM’s as to the feature areas that are designed to be highly available.

Testing:

Before setting up a Highly Available test bed all functionality pertaining to a particular area being tested should pass single box testing. If the code is broken on a single box, it is a waste of time trying to set it up in a multi-box configuration. E.g. A build should pass some base verification tests before it can be considered HA deployable.

Most High-Availability tests are based on functional areas. Based on the feedback from PM’s and discussions with dev’s we need to define areas that can be tested for High Availability.

Tracking progress will require the tools that simulate failure and log the results in a structured manner so that build to build comparisons can be made and the progress of the system can be tracked over time.

NLB [Network Load Balancing] :

Testing NLB clusters will involve simulating web server failure through various mechanisms.

A. Inetinfo process stopped.

B. Network connection lost.

C. Machine rebooted.

D. System wide low resource situation [Low memory, Out of disk space].

Expected behavior is that the load is transferred to the other good server.
State is maintained and not lost.
Authentication credentials are still valid.
Shopping baskets are not lost or corrupt.
Transactions are either committed or rolled back, preserving data integrity.
Performance is acceptable when failure occurs.
The system throughput recovers when a failed node is brought back online.

 

MSCS [MS Cluster Service] :

Testing in this area will involve verifying the behavior of services that have been developed to be cluster aware. Cluster aware services should gracefully failover when problems arise.
Cluster Service is not a load balancing mechanism it provides backup redundancy in case of failure.
Thus, there are 2 options:

A. Active/Active: Requires that an application to be architected around this fact. E.g. Setting up 2 copies of a service to run concurrently can cause unwanted side-effects. Active/Active is usually suited to stateless transacted services.

B. Active/Passive: This is the commonly deployed scenario for MSCS. In this setup only one node is active at any time. If this active node happens to fail the next available passive node takes over the role of active node.

Testing MSCS clusters will involve simulating server failure through various mechanisms.

A. Cluster aware service stopped due to crash or stopped manually for maintenance.

B. Network connection lost.

C. Machine rebooted.

D. System wide low resource situation [Low memory, Out of disk space].

Expected behavior is:
Tasks running at the point of failure provide re-try logic and automatically restart.
Long running tasks do not have to be re-started from the beginning. They re-start at the last known good checkpoint.
Processes affected by the failure do not corrupt data.
The next available active node takes over in a reasonable amount of time.
The next available active node continues to provide service in an acceptable manner.
The system throughput recovers when a failed node is brought back online.

 

CLB:

Component Load Balancing provides scale out options in the middle tier. This is usually at the Application Server level. In order to take advantage of this form of load balancing mechanism COM components need to implement certain generic interfaces. Applications need to be architected to take advantage of the features provided by CLB. Specifically the way state is maintained needs to be clearly defined. Transactions and rollback also need to be built into the application logic. To take advantage of pooling, components need to be totally stateless in addition to implementing one extra interface.

 

Data Center Certification:

This involves passing a list of rigorous tests under both NLB & MSCS.

All areas that pertain to High Availability need to be defined early in the product cycle. Tests need to be developed and run at regular intervals to ensure that we are on track to meet certification requirements. If issues exist these can be highlighted early on and addressed either by fixing code defects or by having to re-architect certain areas of the product to ensure that no single point of failure exists.

Some of the tests include:
Run on Data Center version of Windows.
Run with /3GB & /PAE switches enabled.
Exhausting memory
Saturating CPU
Shutting down the system
Simulating network outage
The purpose of which is to monitor the application under these adverse conditions.
Expected behavior being:
Application waits, and doesn’t lockup.
Application can run from within a job object.
Application timeout can be tuned to continue functioning.
Application provides retry logic rather than fail.
Application provides detailed concise logging information.
Application does not lose critical data.
Application does not cause data corruption
E.g. atomically rollback distributed transactions.

 

Methodology – requirements and expectation from High Availability testers

to be in next post…

Related blog posts