SAPonPower

An ongoing discussion about SAP infrastructure

How to ensure Business Suite on HANA infrastructure is mission critical ready

Companies that plan on running Business Suite on HANA (SoH) require systems that are at least as fault tolerant as their current mission critical database systems.  Actually, the case can be made that these systems have to exceed current reliability design specifications due to the intrinsic conditions of HANA, most notably, but not limited to, extremely large memory sizes.  Other factors that will further exacerbate this include MCOD, MCOS, Virtualization and the new SPS09 feature, Multi-Tenancy.

A customer with 5TB of data in their current uncompressed Suite database will most likely see a reduction due to HANA compression (SAP note 1793345, and the HANA cookbook²) bringing their system size, including HANA work space, to roughly 3TB.  That same customer may have previously been using a database buffer of 100GB +/- 50GB.  At a current buffer size of 100GB, their new HANA system will require 30 times the amount of memory as the conventional database did.  All else being equal, 30x of any component will result in 30x failures.  In 2009, Google engineers wrote a white paper in which they noted that 8% of DIMMS experienced errors every year with most being hard errors and that when a correctable error occurred in a DIMM, there was a much higher chance that another would occur in that same DIMM leading, potentially, to uncorrectable errors.¹  As memory technology has not changed much since then, other than getting denser which could lead to even more likelihood of errors due to cosmic rays and other sources, the risk has likely not decreased.  As a result, unless companies wish to take chances with their most critical asset, they should elect to use the most reliable memory available.

IBM provides exactly that, the best of breed open systems memory reliability, not as an option at a higher cost, but included with every POWER8 system, from the one and two socket scale-out systems to even more advanced capabilities with the 4 & 8-socket systems, some of which will scale to 16-sockets (announced as a Statement of Direction for 2015).  This memory protection is represented in multiple discreet features that work together to deliver unprecedented reliability.  The following gets into quite a bit of technical detail, so if you don’t have your geek hat on, (mine can’t be removed as it was bonded to my head when I was reading Heinlein in 6th grade; yes, I know that dates me), then you may want to jump to the conclusions at the end.

Chipkill – Essentially a RAID like technology that spans data and ECC recovery information across multiple memory chips such that in the event of a chip failure, operations may continue without interruption.   Using x8 chips, Chipkill provides for Single Device Data Correction (SDDC) and with x4 chips, provides Double Device Data Correction (DDDC) due to the way in which data and ECC is spread across more chips simultaneously.

Spare DRAM modules – Each rank of memory (4 ranks per card on scale-out systems, 8 ranks per card on enterprise systems) contains an extra memory chip.  This chip is used to automatically rebuild the data that was held, previously, on the failed chip in the above scenario.  This happens transparently and automatically.  The effect is two-fold:  One, once the recovery is complete, no additional processing is required to perform Chipkill recovery allowing performance to return to pre-failure levels; Two, maintenance may be deferred as desired by the customer as Chipkill can, yet again, allow for uninterrupted operations in the event of a second memory chip failure and, in fact, IBM does not even make a call out for repair until a second chip fails.

Dynamic memory migration and Hypervisor memory mirroring – These are unique technologies only available on IBM’s Enterprise E870 and E880 systems.  In the event that a DIMM experiences errors that cannot be permanently corrected using sparing capability, the DIMM is called out for replacement.  If the ECC is capable of continuing to correct the errors, the call out is known as a predictive callout indicating the possibility of a future failure.  In such cases, if an E870 or E880 has unlicensed or unassigned DIMMS with sufficient capacity to handle it, logical memory blocks using memory from a predictively failing DIMM will be dynamically migrated to the spare/unused capacity. When this is successful this allows the system to continue to operate until the failing DIMM is replaced, without concern as to whether the failing DIMM might cause any future uncorrectable error.  Hypervisor memory mirroring is a selective mirroring technology for the memory used by the hypervisor which means that even a triple chip failure in a memory DIMM would not affect the operations of the hypervisor as it would simply start using the mirror.

L4 cache – Instead of conventional parity or ECC protected memory buffers used by other vendors, IBM utilizes special eDRAM (a more reliable technology to start with) which not only offers dramatically better performance but includes advanced techniques to delete cache lines for persistent recoverable and non-recoverable fault scenarios as well as to deallocate portions of the cache spanning multiple cache lines.

Extra memory lane – the connection from memory DIMMs or cards is made up of dozens of “lanes” which we can see visually as “pins”.  POWER8 systems feature an extra lane on each POWER8 chip.  In the event of an error, the system will attempt to retry the transfer, use ECC correction and if the error is determined by the service processor to be a hard error (as opposed to a soft/transient error), the system can deallocate the failing lane and allocate the spare lane to take its place.  As a result, no downtime in incurred and planned maintenance may be scheduled at a time that is convenient for the customer since all lanes, including the “replaced” one are still fully protected by ECC.

L2 and L3 Caches likewise have an array of protection technology including both cache line delete and cache column repair in addition to ECC and special hardening called “soft latches” which makes these caches less susceptible to soft error events.

As readers of my blog know, I rarely point out only one side of the equation without the other and in this case, the contrast to existing HANA capable systems could not be more dramatic making the symbol between the two sides a very big > symbol; details to follow.

Intel offers a variety of protection technologies for memory but leaves the decision as to which to employ up to customers.  This ranges from “performance mode” which has the least protection to “RAS mode” which has more protection at the cost of reduced performance.

Let’s start with the exclusives for IBM:  eDRAM L4 cache with its inherent superior protection and performance over conventional memory buffer chips, dynamic memory migration and hypervisor memory mirroring available on IBM Enterprise class servers, none of which are available in any form on x86 servers.  If these were the only advantages for Power Systems, this would already be compelling for mission critical systems, but this is only the start:

Lock step – Intel included similar technology to Chipkill in all of their chips which they call Lock step.  Lock step utilizes two DIMMs behind a single memory buffer chip to store a 64-byte cache line + ECC data instead of the standard single DIMM to provide 1x or 2x 8-bit error detection and 8-bit error correction within a single x8 or x4 DRAM respectively (with x4 modules, this is known as Double Device Data Correction or DDDC and is similar to standard POWER Chipkill with x4 modules.)  Lock Step is only available in RAS mode which incurs a penalty relative to performance mode.  Fujitsu released a performance white paper³ in which they described the results of a memory bandwidth benchmark called STREAM in which they described Lock step memory as running at only 57% of the speed of performance mode memory.

Lock step is certainly an improvement over standard or performance mode in that most single device events can be corrected on the fly (and two such events serially for x4 DIMMS) , but correction incurs a performance penalty above and beyond that incurred from being in Lock step mode in the first place.  After the first such failure, for x8 DIMMS, the system cannot withstand a second failure in that Lockstep pair of DIMMS and a callout for repair (read this as make a planned shutdown as soon as possible) be made to prevent a second and fatal error.  For x4 DIMMS, assuming the performance penalty is acceptable, the planned shutdown could be postponed to a more convenient time.  Remember, with the POWER spare DRAMS, no such immediate action is required.

Memory sparing – Since taking an emergency shutdown is unacceptable for a SoH system, Lock Step memory is therefore insufficient since it handles only the emergency situation but does not eliminate the need for a repair action (as the POWER memory spare does) and it incurs a performance penalty due to having to “lash” together two cards to act as one (as compared to POWER that achieves superior reliability with a single memory card).  Some x86 systems offer memory sparing in which one rank per memory channel is configured as a spare.  For instance, with the Lenovo System x x3850, each memory channel supports 3 DIMMs or ranks.  If sparing is used, the effective memory throughput of the system is reduced by 1/3 since one of every 3 DIMMs is no longer available for normal operations and the memory that must be purchased is increased by 50%.  In other words, 1TB of usable memory requires 1.5TB of installed memory.  The downsize of sparing is that it is a predictive failure technology, not a reactive one.  According to the IBM X6 Servers: Technical Overview Redbook-  “Sparing provides a degree of redundancy in the memory subsystem, but not to the extent of mirroring. In contrast to mirroring, sparing leaves more memory for the operating system. In sparing mode, the trigger for failover is a preset threshold of correctable errors. When this threshold is reached, the content is copied to its spare. The failed rank is then taken offline, and the spare counterpart is activated for use.”  In other words, this works best when you can see it coming, not after a part of the memory has failed.    When I asked a gentleman manning the Lenovo booth at TechEd && d-code about sparing, he first looked at me as if I had a horn sticking out of my head and then replied that almost no one uses this technology.  Now, I think I understand why.  This is a good option, but at a high cost and still falls short of POWER8 memory protection which is both predictive and reactive and dynamically responds to unforeseen events.  By comparison, memory sparing requires a threshold to be reached and then enough time to be available to complete a full rank copy, even if only a single chip is showing signs of imminent failure.

Memory mirroring – This technology utilizes a complete second set of memory channels and DIMMs to maintain a second copy of memory at all times.  This allows for a chip or an entire DIMM to fail with no loss of data as the second copy immediately takes over.  This option, however, does require that you double the amount of memory in the system, utilize plenty of system overhead to keep the pairs synchronized and take away ½ of the memory bandwidth (the other half of which goes to the copy).  This option may perform better than the memory sparing option because reads occur from both copies in an interleaved manner, but writes have to occur to both synchronously.

Conclusions:

Memory mirroring for x86 systems is the closest option to the continuous memory availability that POWER8 delivers.  Of course, having to purchase 2TB of memory in order to have proper protection of 1TB of effective memory adds a significant cost to the system and takes away substantial memory bandwidth.  HANA utilizes memory as few other systems do.

The problem is that x86 vendors won’t tell customers this.  Why?  Now, I can only speculate, but that is why I have a blog.  The x86 market is extremely competitive.  Most customers ask multiple vendors to bid on HANA opportunities.  It would put a vendor at a disadvantage to include this sort of option if the customer has not required it of all vendors.  In turn, x86 vendors don’t won’t to even insinuate that they might need such additional protection as that would imply a lack of reliability to meet mission critical standards.

So, let’s take this to the next logical step.  If a company is planning on implementing SoH using the above protection, they will need to double their real memory.  Many customers will need 4TB, 8TB or even some in the 12TB to 16TB range with a few even larger.  For the 4TB example, an 8TB system would be required which, as of the writing of this blog post, is not currently certified by SAP.  For the 8TB example, 16TB would be required which exceeds most x86 vendor’s capabilities.  At 12TB, only two vendors have even announced the intention of building a system to support 24TB and at 16TB, no vendor has currently announced plans to support 32TB of memory.

Oh, by the way, Fujitsu, in the above referenced white paper, measured the memory throughput of a system with memory mirroring and found it to be 69% that of a performance optimized system.  Remember, HANA demands extreme memory throughput and benchmarks typically use the fastest memory, not necessarily the most reliable meaning that if sizings are based on benchmarks, they may require adjustment when more reliable memory options are utilized.  Would larger core counts then be required to drive the necessary memory bandwidth?

Clearly, until SAP writes new rules to accommodate this necessary technology or vendors run realistic benchmarks showing just how much cpu and memory capacity is needed to support a properly mirrored memory subsystem on an x86 box, customers will be on their own to figure out what to do.

That guess work will be removed once HANA on Power GAs as it already includes the mission critical level of memory protection required for SoH and does so without any performance penalty.

Many thanks to Dan Henderson, IBM RAS expert extraordinaire, from whom I liberally borrowed some of the more technically accurate sentences in this post from his latest POWER8 RAS whitepaper¹¹ and who reviewed this post to make sure that I properly represented both IBM and non-IBM RAS options.

¹ http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
² https://cookbook.experiencesaphana.com/bw/operating-bw-on-hana/hana-database-administration/monitoring-landscape/memory-usage/
³ http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CD0QFjAA&url=http%3A%2F%2Fdocs.ts.fujitsu.com%2Fdl.aspx%3Fid%3D8ff6579c-966c-4bce-8be0-fc7a541b4a02&ei=t9VsVIP6GYW7yQTGwIGICQ&usg=AFQjCNHS1fOnd_QAnVV6JjRju9iPlAZkQg&bvm=bv.80120444,d.aWw
¹¹ http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=WH&infotype=SA&appname=STGE_PO_PO_USEN&htmlfid=POW03133USEN&attachment=POW03133USEN.PDF#loaded.
Advertisement

November 19, 2014 - Posted by | Uncategorized | , , , , , , , ,

2 Comments »

  1. Hi,
    Trying to check availability of SAP HANA scale-out clusters on Power.
    My current info is that scale-out (using GPFS) in active/active mode is only available for BW on SAP HANA, with max of 16 nodes.

    Comment by Sameh Zaghloul | July 25, 2016 | Reply

    • Hi Sameh, Scale-out is supported with or without GPFS, now called Spectrum Scale, for SAP HANA limited to 16 nodes without any requirement to obtain approval from SAP. If more than 16 nodes are required, customers should contact SAP with that request. SAP Note 2055470 – HANA on POWER Planning and Installation Specifics – Central Note, does not specify that scale-out is only supported with BW however, so one might conclude that this implies that any environment for which scale-out is GA can be supported with HANA on Power. That said, I am reading the SAP Notes just like anyone else and don’t have any inside knowledge on this topic.

      Comment by Alfred Freudenberger | July 26, 2016 | Reply


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: