An ongoing discussion about SAP infrastructure

Improving x86 availability – A good reason for Oracle RAC in the SAP environment

Despite a couple of blog posts which seemed to question the use of Oracle RAC for SAP, it is important to point out that Oracle RAC does an excellent job for maintaining very high availability of the database environment for SAP.  When implemented as an HA solution, i.e. two node RAC cluster with all traffic directed to one node and the other serves as passive backup, RAC can deliver the least amount of downtime that is possible for Oracle DB environments.  In most cases following an outage, SAP application servers will reconnect with the backup server without hitting an IP timeout and the total duration of unavailability of the DB may be in the 15 to 30 second range (although it can take longer for very large databases with a lot of write activity thereby requiring more roll forward/back processing.)  It is important to note that availability of the DB does not imply good performance of the DB.  All of the data and programs that had been in the primary systems SGA and PGA have to be reloaded in the backup system’s SGA and PGA which can take an hour or longer in some cases.  This means that read transactions during this time may involve actual physical disk reads which are many orders of magnitude slower than memory reads (3 to 10ms vs. 100ns or .0001ms).

So, you might ask, for what types of systems and situations is Oracle RAC then a good solution?   The answer depends on a number of factors.  For Power Systems, some customers expect a partition to fail no more than once every 4 or 5 years.  If a conventional HA failover takes 20 minutes, then implementing Oracle RAC might eliminate that 20 minute outage, but that works out to about 4 or 5 minutes of outage avoidance per year.  For some customers,20 minutes at their peak processing times might have an incredibly high cost and spread over 4 or 5 years, might be worth the investment.  For most customers however, the 4 or 5 minutes of reduced downtime per year would not justify the cost.

But, not all systems are built with the reliability features of Power Systems.  x86 systems, in particular, lack many of the features that are now standard on Power Systems.  For example, Power Systems features First Failure Data Capture (FFDC), an elegant system that detects virtually every type of soft or hard error that can occur within a system allowing for preventive action prior to an actual failure and/or precise identification of the cause of failure thereby speeding resolution through either fast reboot around failing components and/or specific part.  For x86 systems, Intel unveiled their first attempt at duplicating this technology called Machine Check Architecture Recovery (MCA) but it is very limited in what it can detect and there are very few preventative actions that can be taken to avoid a failure much less even identify what caused the failure.  To add insult to injury, where Power Systems utilize a hardware service processor to examine causes of errors and direct action to fix on the fly, deallocate components or cause a partition to reboot around failing components, MCA Recovery relies on operating systems or virtualization managers to solve hardware related problems.  Can anyone actually imagine Microsoft writing code to fix a problem within an Intel processor on a Dell system?   Same goes for VMware, RedHat, Novell, Oracle and any other company that offers OS or hypervisors for x86.

In addition, Power Systems offers a wealth of exclusive features for Power Systems such dynamic processor deallocation, check point/restart and alternate processor recovery (which can handle processor core errors on the fly including core failure without affecting running workloads), cache line delete, and many, many more.   Together, these features help explain why Power Systems offer unmatched system availability.

A highly respected data acquisition and analysis firm, called Solitaire Interglobal (SIL) , , recently completed an in-depth study into this any many other subjects.

Before I share my observations about their findings, let me first explain what they do.  Unlike the typical consultancy company, they rely, to a large part, on data that they have acquired from approximately 43,260+ customer sites with systems types and OSs across the entire spectrum.  When they determine a number for availability, for example,  this is a calculated result based on the availability reported by these customers.  They evaluated the overall availability of Power Systems with AIX, x86 with Microsoft  and x86 with Linux.  In the best case, Linux systems suffered “only” 44 more hours of system unavailability.  In the worst, Windows systems suffered an amazing 193 more hours of system unavailability.

These numbers are derived from taking the availability percentage numbers sited in the study and breaking these down into hours of downtime.  As both planned and unplanned downtime are included it is useful to understand the cited cause of the difference, specifically and in order of importance, lower break-fix activity needed, fewer system patches and updates required and more “forgiving” system to application characteristics.  You can imagine that RAC would help dramatically with the first two.  In other words, in order to achieve similar levels of availability, as delivered by a standard single system image Power System, Linux and Windows systems on x86 would require Oracle RAC.  OK, that is my opinion, but the facts certainly seem to back up that conclusion.

Recently, I met with a customer that was evaluating a new SAP implementation and considering both Power Systems and x86.  Based on their RFP, it appears that they had come to this conclusion completely independently and without the facts from the SIL study.   Their planned implementation covers a broad array of SAP components which is pretty typical of modern SAP implementations.  Availability was their highest priority as they plan to integrate SAP deeply into their manufacturing processes meaning an outage of the SAP environment could have far reaching implications.  Each instance is well within the capabilities of just about any x86 system and in fact, a two node cluster far exceeds what each might require at peak.  However, since Oracle is not properly supported under VMware (see previous blog entry on the subject) much less Oracle RAC which is most definitely not supported under VMware, each instance requires its own RAC cluster.  By comparison, IBM’s Power Systems offer both the needed levels of support, has complete support by Oracle in a virtualized mode and has outstanding levels of availability in a conventional HA cluster.  In fact, prod, non-prod and HA can all share the same cluster of two systems.  The architecture that developed looked as follows:

Notice that the x86 configuration will likely require over 11 times as many servers, 9 times the SAN ports, over 4 times the network ports not to mention the requirement to manage various different HA scenarios including Oracle RAC, Symantec HA and passive failover environments.  It is not hard to imagine that any initial difference in cost of acquisition would quickly be eaten up by higher infrastructure, SAN and Network, systems and database management costs plus the 3% extra SAV required to purchase RAC through SAP.
I end with the title of this blog entry.  There is a good reason to use Oracle RAC in the SAP environment, i.e.  to deliver the same level of availability as delivered by a single instance Power Systems implementation but doing so would result in a far more complex environment with much higher overall costs.


October 27, 2011 - Posted by | Uncategorized

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: