SAPonPower

An ongoing discussion about SAP infrastructure

How to ensure Business Suite on HANA infrastructure is mission critical ready

Companies that plan on running Business Suite on HANA (SoH) require systems that are at least as fault tolerant as their current mission critical database systems.  Actually, the case can be made that these systems have to exceed current reliability design specifications due to the intrinsic conditions of HANA, most notably, but not limited to, extremely large memory sizes.  Other factors that will further exacerbate this include MCOD, MCOS, Virtualization and the new SPS09 feature, Multi-Tenancy.

A customer with 5TB of data in their current uncompressed Suite database will most likely see a reduction due to HANA compression (SAP note 1793345, and the HANA cookbook²) bringing their system size, including HANA work space, to roughly 3TB.  That same customer may have previously been using a database buffer of 100GB +/- 50GB.  At a current buffer size of 100GB, their new HANA system will require 30 times the amount of memory as the conventional database did.  All else being equal, 30x of any component will result in 30x failures.  In 2009, Google engineers wrote a white paper in which they noted that 8% of DIMMS experienced errors every year with most being hard errors and that when a correctable error occurred in a DIMM, there was a much higher chance that another would occur in that same DIMM leading, potentially, to uncorrectable errors.¹  As memory technology has not changed much since then, other than getting denser which could lead to even more likelihood of errors due to cosmic rays and other sources, the risk has likely not decreased.  As a result, unless companies wish to take chances with their most critical asset, they should elect to use the most reliable memory available.

IBM provides exactly that, the best of breed open systems memory reliability, not as an option at a higher cost, but included with every POWER8 system, from the one and two socket scale-out systems to even more advanced capabilities with the 4 & 8-socket systems, some of which will scale to 16-sockets (announced as a Statement of Direction for 2015).  This memory protection is represented in multiple discreet features that work together to deliver unprecedented reliability.  The following gets into quite a bit of technical detail, so if you don’t have your geek hat on, (mine can’t be removed as it was bonded to my head when I was reading Heinlein in 6th grade; yes, I know that dates me), then you may want to jump to the conclusions at the end.

Chipkill – Essentially a RAID like technology that spans data and ECC recovery information across multiple memory chips such that in the event of a chip failure, operations may continue without interruption.   Using x8 chips, Chipkill provides for Single Device Data Correction (SDDC) and with x4 chips, provides Double Device Data Correction (DDDC) due to the way in which data and ECC is spread across more chips simultaneously.

Spare DRAM modules – Each rank of memory (4 ranks per card on scale-out systems, 8 ranks per card on enterprise systems) contains an extra memory chip.  This chip is used to automatically rebuild the data that was held, previously, on the failed chip in the above scenario.  This happens transparently and automatically.  The effect is two-fold:  One, once the recovery is complete, no additional processing is required to perform Chipkill recovery allowing performance to return to pre-failure levels; Two, maintenance may be deferred as desired by the customer as Chipkill can, yet again, allow for uninterrupted operations in the event of a second memory chip failure and, in fact, IBM does not even make a call out for repair until a second chip fails.

Dynamic memory migration and Hypervisor memory mirroring – These are unique technologies only available on IBM’s Enterprise E870 and E880 systems.  In the event that a DIMM experiences errors that cannot be permanently corrected using sparing capability, the DIMM is called out for replacement.  If the ECC is capable of continuing to correct the errors, the call out is known as a predictive callout indicating the possibility of a future failure.  In such cases, if an E870 or E880 has unlicensed or unassigned DIMMS with sufficient capacity to handle it, logical memory blocks using memory from a predictively failing DIMM will be dynamically migrated to the spare/unused capacity. When this is successful this allows the system to continue to operate until the failing DIMM is replaced, without concern as to whether the failing DIMM might cause any future uncorrectable error.  Hypervisor memory mirroring is a selective mirroring technology for the memory used by the hypervisor which means that even a triple chip failure in a memory DIMM would not affect the operations of the hypervisor as it would simply start using the mirror.

L4 cache – Instead of conventional parity or ECC protected memory buffers used by other vendors, IBM utilizes special eDRAM (a more reliable technology to start with) which not only offers dramatically better performance but includes advanced techniques to delete cache lines for persistent recoverable and non-recoverable fault scenarios as well as to deallocate portions of the cache spanning multiple cache lines.

Extra memory lane – the connection from memory DIMMs or cards is made up of dozens of “lanes” which we can see visually as “pins”.  POWER8 systems feature an extra lane on each POWER8 chip.  In the event of an error, the system will attempt to retry the transfer, use ECC correction and if the error is determined by the service processor to be a hard error (as opposed to a soft/transient error), the system can deallocate the failing lane and allocate the spare lane to take its place.  As a result, no downtime in incurred and planned maintenance may be scheduled at a time that is convenient for the customer since all lanes, including the “replaced” one are still fully protected by ECC.

L2 and L3 Caches likewise have an array of protection technology including both cache line delete and cache column repair in addition to ECC and special hardening called “soft latches” which makes these caches less susceptible to soft error events.

As readers of my blog know, I rarely point out only one side of the equation without the other and in this case, the contrast to existing HANA capable systems could not be more dramatic making the symbol between the two sides a very big > symbol; details to follow.

Intel offers a variety of protection technologies for memory but leaves the decision as to which to employ up to customers.  This ranges from “performance mode” which has the least protection to “RAS mode” which has more protection at the cost of reduced performance.

Let’s start with the exclusives for IBM:  eDRAM L4 cache with its inherent superior protection and performance over conventional memory buffer chips, dynamic memory migration and hypervisor memory mirroring available on IBM Enterprise class servers, none of which are available in any form on x86 servers.  If these were the only advantages for Power Systems, this would already be compelling for mission critical systems, but this is only the start:

Lock step – Intel included similar technology to Chipkill in all of their chips which they call Lock step.  Lock step utilizes two DIMMs behind a single memory buffer chip to store a 64-byte cache line + ECC data instead of the standard single DIMM to provide 1x or 2x 8-bit error detection and 8-bit error correction within a single x8 or x4 DRAM respectively (with x4 modules, this is known as Double Device Data Correction or DDDC and is similar to standard POWER Chipkill with x4 modules.)  Lock Step is only available in RAS mode which incurs a penalty relative to performance mode.  Fujitsu released a performance white paper³ in which they described the results of a memory bandwidth benchmark called STREAM in which they described Lock step memory as running at only 57% of the speed of performance mode memory.

Lock step is certainly an improvement over standard or performance mode in that most single device events can be corrected on the fly (and two such events serially for x4 DIMMS) , but correction incurs a performance penalty above and beyond that incurred from being in Lock step mode in the first place.  After the first such failure, for x8 DIMMS, the system cannot withstand a second failure in that Lockstep pair of DIMMS and a callout for repair (read this as make a planned shutdown as soon as possible) be made to prevent a second and fatal error.  For x4 DIMMS, assuming the performance penalty is acceptable, the planned shutdown could be postponed to a more convenient time.  Remember, with the POWER spare DRAMS, no such immediate action is required.

Memory sparing – Since taking an emergency shutdown is unacceptable for a SoH system, Lock Step memory is therefore insufficient since it handles only the emergency situation but does not eliminate the need for a repair action (as the POWER memory spare does) and it incurs a performance penalty due to having to “lash” together two cards to act as one (as compared to POWER that achieves superior reliability with a single memory card).  Some x86 systems offer memory sparing in which one rank per memory channel is configured as a spare.  For instance, with the Lenovo System x x3850, each memory channel supports 3 DIMMs or ranks.  If sparing is used, the effective memory throughput of the system is reduced by 1/3 since one of every 3 DIMMs is no longer available for normal operations and the memory that must be purchased is increased by 50%.  In other words, 1TB of usable memory requires 1.5TB of installed memory.  The downsize of sparing is that it is a predictive failure technology, not a reactive one.  According to the IBM X6 Servers: Technical Overview Redbook-  “Sparing provides a degree of redundancy in the memory subsystem, but not to the extent of mirroring. In contrast to mirroring, sparing leaves more memory for the operating system. In sparing mode, the trigger for failover is a preset threshold of correctable errors. When this threshold is reached, the content is copied to its spare. The failed rank is then taken offline, and the spare counterpart is activated for use.”  In other words, this works best when you can see it coming, not after a part of the memory has failed.    When I asked a gentleman manning the Lenovo booth at TechEd && d-code about sparing, he first looked at me as if I had a horn sticking out of my head and then replied that almost no one uses this technology.  Now, I think I understand why.  This is a good option, but at a high cost and still falls short of POWER8 memory protection which is both predictive and reactive and dynamically responds to unforeseen events.  By comparison, memory sparing requires a threshold to be reached and then enough time to be available to complete a full rank copy, even if only a single chip is showing signs of imminent failure.

Memory mirroring – This technology utilizes a complete second set of memory channels and DIMMs to maintain a second copy of memory at all times.  This allows for a chip or an entire DIMM to fail with no loss of data as the second copy immediately takes over.  This option, however, does require that you double the amount of memory in the system, utilize plenty of system overhead to keep the pairs synchronized and take away ½ of the memory bandwidth (the other half of which goes to the copy).  This option may perform better than the memory sparing option because reads occur from both copies in an interleaved manner, but writes have to occur to both synchronously.

Conclusions:

Memory mirroring for x86 systems is the closest option to the continuous memory availability that POWER8 delivers.  Of course, having to purchase 2TB of memory in order to have proper protection of 1TB of effective memory adds a significant cost to the system and takes away substantial memory bandwidth.  HANA utilizes memory as few other systems do.

The problem is that x86 vendors won’t tell customers this.  Why?  Now, I can only speculate, but that is why I have a blog.  The x86 market is extremely competitive.  Most customers ask multiple vendors to bid on HANA opportunities.  It would put a vendor at a disadvantage to include this sort of option if the customer has not required it of all vendors.  In turn, x86 vendors don’t won’t to even insinuate that they might need such additional protection as that would imply a lack of reliability to meet mission critical standards.

So, let’s take this to the next logical step.  If a company is planning on implementing SoH using the above protection, they will need to double their real memory.  Many customers will need 4TB, 8TB or even some in the 12TB to 16TB range with a few even larger.  For the 4TB example, an 8TB system would be required which, as of the writing of this blog post, is not currently certified by SAP.  For the 8TB example, 16TB would be required which exceeds most x86 vendor’s capabilities.  At 12TB, only two vendors have even announced the intention of building a system to support 24TB and at 16TB, no vendor has currently announced plans to support 32TB of memory.

Oh, by the way, Fujitsu, in the above referenced white paper, measured the memory throughput of a system with memory mirroring and found it to be 69% that of a performance optimized system.  Remember, HANA demands extreme memory throughput and benchmarks typically use the fastest memory, not necessarily the most reliable meaning that if sizings are based on benchmarks, they may require adjustment when more reliable memory options are utilized.  Would larger core counts then be required to drive the necessary memory bandwidth?

Clearly, until SAP writes new rules to accommodate this necessary technology or vendors run realistic benchmarks showing just how much cpu and memory capacity is needed to support a properly mirrored memory subsystem on an x86 box, customers will be on their own to figure out what to do.

That guess work will be removed once HANA on Power GAs as it already includes the mission critical level of memory protection required for SoH and does so without any performance penalty.

Many thanks to Dan Henderson, IBM RAS expert extraordinaire, from whom I liberally borrowed some of the more technically accurate sentences in this post from his latest POWER8 RAS whitepaper¹¹ and who reviewed this post to make sure that I properly represented both IBM and non-IBM RAS options.

¹ http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
² https://cookbook.experiencesaphana.com/bw/operating-bw-on-hana/hana-database-administration/monitoring-landscape/memory-usage/
³ http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CD0QFjAA&url=http%3A%2F%2Fdocs.ts.fujitsu.com%2Fdl.aspx%3Fid%3D8ff6579c-966c-4bce-8be0-fc7a541b4a02&ei=t9VsVIP6GYW7yQTGwIGICQ&usg=AFQjCNHS1fOnd_QAnVV6JjRju9iPlAZkQg&bvm=bv.80120444,d.aWw
¹¹ http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=WH&infotype=SA&appname=STGE_PO_PO_USEN&htmlfid=POW03133USEN&attachment=POW03133USEN.PDF#loaded.
Advertisement

November 19, 2014 Posted by | Uncategorized | , , , , , , , , | 2 Comments

SAP TechEd Video – Shall I Stay or Shall I Go … to x86? Technical Factors and Considerations.

For any that might be interested in seeing a presentation that I delivered at SAP TechEd Las Vegas 2011, please click on this link.   This presentation discusses why customers might consider x86 for their SAP environments and the reasons why Power Systems may deliver lower costs, better reliability and security.   The video lasts approximately 60 minutes.

October 4, 2011 Posted by | Uncategorized | , , , , , , , | Leave a comment

vSphere 5.0 compared to PowerVM

Until recently, VMware partitions suffered from a significant scalability limitation. Each partition could scale to a maximum of 8 virtual processors (vp) with vSphere 4.1 Enterprise Edition. For many customers and uses, this did not pose much of an issue as some of the best candidates for x86 virtualization are the thousands of small, older servers which can easily fit within a single core of a modern Intel or AMD chip. For SAP customers, however, the story was often quite different. Eight vp does not equate to 8 cores, it equates to 8 processor threads. Starting with Nehalem, Intel offered HyperThreading which allowed each core to run two different OS threads simultaneously. This feature boosted throughput, on average, by about 30% and just about all benchmarks since that time have been run with HyperThreading enabled. Although it is possible to disable it, few customers elect to do so as it removes that 30% increased throughput from the system. With HyperThreading enabled, 8 VMware vp utilize 4 cores/8 threads which can be as little as 20% of the cores on a single chip. Put in simple terms, this can be as little as 5,000 SAPS depending on the version and MHz of the chip. Many SAP customers routinely run their current application servers at 5,000 to 10,000 SAPS, meaning moving these servers to VMware partitions would result in the dreaded hotspot, i.e. bad performance and a flood of calls to the help desk. By comparison, PowerVM (IBM’s Power Systems virtualization technology) partitions may scale as large as the underlying hardware and if that limit is reached, may be migrated live to a larger server, assuming one exists in the cluster, and the partition allowed to continue to operate without interruption and a much higher partition size capability.

 
VMware recently introduced vSphere 5.0. Among a long list of improvements is the ability to utilize 32 vp for a single partition. On the surface, this would seem to imply that VMware can scale to all but a very few large demands. Once you dig deeper, several factors emerge. As vSphere 5.0 is very new, there are not many benchmarks and even less customer experience. There is no such thing as a linearly scalable server, despite benchmarks that seem to imply this, even from my own company. All systems have a scalability knee of the curve. Where some workloads, e.g. AIM7, when tested by IBM showed up to 7.5 times the performance with 8 vp compared to 1 vp on a Xeon 5570 system with vSphere 4.0 update 1, it is worthwhile to note that this was only achieved when no other partitions were running, clearly not the reason why anyone would utilize VMware. In fact, one would expect just the opposite, that an overcommitment of CPU resources would be utilized to get the maximum throughput of a system. On another test, DayTrader2.0 in JDBC mode, a scalability maximum of 4.67 the performance of a single thread was reached with 8 vp, once again while running no other VMs. It would be reasonable to assume that VMware has done some scaling optimization but it would be premature and quite unlikely to assume that 32 vp will scale even remotely close to 4 times the performance of an 8 vp VM. When multiple VMs run at the same time, VMware overhead and thread contention may reduce effective scaling even further. For the time being, a wise customer would be well advised to wait until more evidence is presented before assuming that all scaling issues have been resolved.

But this is just one issue and, perhaps, not the most important one. SAP servers are by their very nature, mission critical. For database servers, any downtime can have severe consequences. For application servers, depending on how customers implement their SAP landscapes and the cost of downtime, some outages may not have as large the consequence. It is important to note that when an application server fails, the context for each user ‘s session is lost. In a best case scenario, the users can recall all necessary details to re-run the transactions in flight after re-logging on to another application server. This means that the only loss is the productivity of that user multiplied by the number of users previously logged on and doing productive work on that server. Assuming 500 users and 5 minutes to get logged back on, transaction initiated through to completion, this is only 2,500 minutes of lost productivity which at a loaded cost of $75,000 per employee is only a total loss to the company of $1,500 per occurrence. With one such occurrence per application server per year, this would result in $6,000 of cost over 5 years and should be included in any comparison of TCO. Of course, this does not take into consideration any IT staff time required to fix the server, any load on the help desk to help resolve issues, nor any political cost to IT if failures happen too frequently. But what happens if the users are unable to recall all of the details necessary to re-run the transactions or what happens if tight integration with production requires that manufacturing be suspended until all users are able to get back to where they had been? The costs can escalate very quickly.

So, what is my point? All x86 hypervisors, including VMware 4.1 and 5.0, are software layers on top of the hardware. In the event of an uncorrectable error in the hardware, the hypervisor usually fails and, in turn, takes down all VMs that it is hosting. Furthermore, problems are not just confined to the CPU, but could be caused by memory, power supplies, fans or a large variety of other components. I/O is yet another critical issue. VMware provides shared I/O resources to partitions, but it does this sharing within the same hypervisor. A device driver error, physical card error or, in some cases, even an external error in a cable, for example, might result in a hypervisor critical error and resulting outage. In other words, the hypervisor becomes a very large single point of failure. In order to avoid the sort of costs described above, most customers try to architect mission critical systems to reduce single points of failure not introduce new ones.

PowerVM takes the opposite approach. First, it is implemented in hardware and firmware. As the name implies, hardware is hardened meaning it is inherently more reliable and far less code is required since many functions are built into the chip.

Second, PowerVM acts primarily as an elegant dispatcher. In other words, it decides which partition executes next in a given core, but then it gets out of the way and allows that partition to execute natively in that core with no hypervisor in the middle of it. This means that if an uncorrectable error were to occur, an exceedingly rare event for Power Systems due to the wide array of fault tolerant components not available in any x86 server, in most situations the error would be confined to a single core and the partition executing in that core at that moment.

Third, sharing of I/O is done through the use of a separate partition called the Virtual I/O (VIO) server. This is done to remove this code from the hypervisor, thereby making they hypervisor more resilient and also to allow for extra redundancy. In most situations, IBM recommends that customers utilize more than one VIO server and spread I/O adapters across those servers with redundant virtual connections to each partition. This means that if an error were to occur in a VIO server, once again a very rare event, only the VIO server might fail, but the other VIO servers would not fail and there would be no impact on the hypervisor since it is not involved in the sharing of I/O at all. Furthermore, partitions would not fail since they would be multipathing virtual devices across more than one VIO server.

So even if VMware can scale beyond 8vp, the question is how much of your enterprise are you ready to place on a single x86 server? 500 users? 1,000 users? 5,000 users? Remember, 500 users calling the help desk at one time would result in long delays. 1,000 at the same time would result in many individuals not waiting and calling their LOB execs instead.

In the event that this is not quite enough of a reason to select Power and PowerVM over x86 with VMware, it is worthwhile to consider the security exposure differences. This has been covered already in a prior blog entry comparing Power to x86 servers, but is worthwhile noting again. PowerVM has no known vulnerabilities according to the National Vulnerability Database, http:// nvd.nist.gov. By comparison, a search on that web site for VMware results in 119 hits. Admittedly, this includes older versions as well as workstation versions, but it is clear that hackers have historically found weaknesses to exploit. VMware has introduced vShield with vSphere 5.0, a set of technologies intended to make VMware more secure, but only it would be prudent to wait and see if this closes all holes or opens new ones.

Also covered in the prior blog entry, the security of the hypervisor is only one piece of the equation. Equally, or perhaps more important is the security of the underlying OSs. Likewise, AIX is among the least vulnerable OSs with Linux and Windows having an order of magnitude more vulnerabilities. Also covered in that blog was a discussion about problem isolation, determination and vendor ownership of problems to drive them to successful resolution. With IBM, almost the entire stack is owned by IBM and supported for mission critical computing whereas with x86, the stack is a hodgepodge of vendors with different support agreements, capabilities and views on who should be responsible for a problem often resulting in finger pointing.

There is no question that VMware has tremendous potential for applications that are not mission critical as well as being an excellent fit for many non-production environments. For SAP, the very definition of mission critical, a more robust, more scalable and better secured environment is needed and Power Systems with PowerVM does an excellent job of delivering on these requirements.

Oh, I did not mention cost. With the new memory based pricing model for vSphere 5.0, applications such as SAP, which demand enormous quantities of memory, may easily exceed the new limits for memory pool size forcing the purchase of additional VMware licenses. Those extra license costs and their associated maintenance, can easily add enough cost that the price differences, if there are any, between Power and x86 further close to be almost meaningless.

August 29, 2011 Posted by | Uncategorized | , , , , , , , , , , , , , , , , , , , , , | 5 Comments

IBM Power Systems compared to x86 for SAP landscapes

It seems like every other day, someone asks me to help them justify why a customer should select IBM Power Systems over x86 alternatives for new or existing SAP customers. Here is a short summary of the key attributes that most customers require and the reasons why Power Systems excels or conversely, where x86 systems fall short.

TCO – Total Cost of Ownership is usually at the top of everyone’s list. Often this is confused with TCA or Total Cost of Acquisition. TCA can be very important for some individuals within customer organizations, especially when those individuals are only responsible for capital acquisition costs and not operational costs such as maintenance, power, cooling, floor space, personnel, software and other assorted costs. TCA can also be important when only capital budgets are restricted. For most customers, however, TCO is far more important. Some evaluators compare systems, one for one. While this might seem to make sense, would it be reasonable to compare a pickup truck and an 18-wheeler semi? Obviously not, so, to do a fair job of comparing TCO, a company must look at all aspects, purposes and effects of different choices. For instance, with IBM Power Systems, customers routinely utilize PowerVM, the IBM Power virtualization technology, to combine many different workloads including ERP, CRM, BW, EP, SCM, SRM and other production database and application servers, high availability servers, backup/recovery servers and non-production servers onto a single, small set of servers. While some of this is possible with x86 virtualization technologies, it is rarely done, partly due to “best practices” separation of workloads and also due to support restrictions by some software products, such as Oracle database, when used in a virtualized x86 environment. This typically results in a requirement for many more servers. Likewise, many Power Systems customers routinely drive their utilization to 80% or higher, where the best of x86 virtualization customers rarely drive to even 50% utilization. Taken together, it is very common to see 2 or 3 times the number of systems for x86 customers than for equivalently sized Power Systems customers and I provided only two reasons of the many frequently experienced by SAP customers. So, where an individual Power System might be slightly higher in cost than the equivalent x86 server, full SAP landscapes on Power Systems often require far fewer systems. Between a potentially lower cost of acquisition and the associated lower cost of management, less power, cooling, floor space and often lower cost of third party software, customers can see a significantly lower TCO with IBM Power Systems.

For customers which are approaching the limits on their data centers, either in terms of floor space, power or cooling, x86 horizontal proliferation may drive the need for data center expansion that could cost into the many millions of dollars. Power Systems may help customers to achieve radically higher levels of consolidation through its far more advanced virtualization and much higher scalability thereby potentially avoiding the need for that data center expansion. The savings, in this event, would make the other savings seem trivial by comparison.

Reliability – A system which is low cost but suffers relatively high numbers of outages may not be the best option for mission critical systems such as SAP. IBM Power Systems feature an impressive array of reliability technologies that are not available on any x86 system. This starts with failure detection circuitry which is built into the entire system including the processor chips and is called First Failure Data Capture (FFDC). FFDC has been offered and improved upon since the mid-90’s for Power Systems and its predecessors. This unique technology captures soft and hard errors from within the hardware allowing the service processor, standard with every system, to predict failures which could impact application availability and take preventive action such as dynamically deallocating components from adapter cards to memory and cache lines and even processor cores. Intel, starting with Nehalem-EX, offers Machine Check Architecture Recovery (MCA), their first version of a similar concept. As a first version, it is doubtful that it can approach the much more mature FFDC technology from IBM. Even more important is the “architecture” which, once errors are detected, passes that information, not to a service processor, but to the Operating System or Virtualization Manager with the “option” for that software to fix the problem in the hardware. This is like your car telling you that your braking system has a problem. Even if you have the mechanical ability to run advanced diagnostics, remove and replace parts, bleed the system, etc., this would involve a significant outage and most certainly could not be done on the fly. Likewise, it is extremely doubtful that Microsoft, for instance, is going to invest in software to fix a problem in an Intel processor especially since this area is likely going to change and only addresses one potential area of reliability. Furthermore, does Microsoft actually want to take on responsibility for hardware reliability? This is just one example, of many, that affect uptime, but without which SAP systems can be exposed.

Equally important is what happens if a problem does occur. Unless you are very lucky, you have experienced the Blue Screen of Death at least one or a hundred times in your past. This is one of those wonderful things that can occur when you don’t have a comprehensive reliability architecture such as that with IBM Power Systems. With x86 systems, essentially, the OS reports that a problem has occurred which could be related to the CPU, system hardware, OS, device driver, firmware, memory, application software, adapter cards, etc. and that your best course of action is to remove the last thing you installed and reboot your system. When you call your system vendor, they might suggest that you contact your OS vendor which might suggest you contact your virtualization vendor which might suggest the problem lies in your BIOS and on and on. Who takes responsibility and ownership and drives the problem to resolution? With IBM Power Systems, IBM develops and supports its own CPU, firmware, system hardware, virtualization, device drivers, OS (assuming AIX or i for Business), memory controllers and buffer chips and has a comprehensive set of rules and detection circuitry for third party hardware and software. This means that in the very rare event of an intermittent or hard to identify error occurs, which is not detected and corrected automatically, IBM takes ownership and resolves the problem unless it is determined that a third party piece of hardware or software caused the problem. In that case, IBM works diligently with its partners to resolve which includes IBM personnel that work on site at many of their partner locations such as Oracle and SAP.

Security – Often an afterthought, but potentially an extremely expensive one, should be carefully considered. PowerVM has never been successfully hacked as noted at http://nvd.nist.gov. AIX has approximately 0% of Critical and High Vulnerabilities and 2% of all OS vulnerabilities compared with 73% and 27% for Microsoft, respectively and 16% and 31% for Linux respectively. X-Force report – Mid-year 2010 http://www-935.ibm.com/services/us/iss/xforce/trendreports/ . A successful hack could result in just a personnel inconvenience for the IT staff, the loss of systems and/or in a worst case scenario, the theft of proprietary and/or personal data. SAP systems usually hold the crown jewels of an enterprise customer and should be among the best protected of any customer systems.

Bottom line – Where individual x86 systems may have a lower price tag than the equivalent Power System, full SAP landscapes will often require far fewer systems with Power Systems resulting in a lower TCO. Add to that much better reliability, fault detection, comprehensive problem resolution and ownership and rock solid security and the case for IBM Power Systems for SAP landscapes is pretty overwhelming.

August 15, 2011 Posted by | Uncategorized | , , , , , , , , , , , , , , , , , , , , | 6 Comments