SAPonPower

An ongoing discussion about SAP infrastructure

HANA on Power hits the Trifecta!

Actually, trifecta would imply only 3 big wins at the same time and HANA on Power Systems just hit 4 such big wins.

Win 1 – HANA 2.0 was announced by SAP with availability on Power Systems simultaneously as with Intel based systems.[i]  Previous announcements by SAP had indicated that Power was now on an even footing as Intel for HANA from an application support perspective, however until this announcement, some customers may have still been unconvinced.  I noticed this on occasion when presenting to customers and I made such an assertion and saw a little disbelief on some faces.  This announcement leaves no doubt.

Win 2 – HANA 2.0 is only available on Power Systems with SUSE SLES 12 SP1 in Little Endian (LE) mode.  Why, you might ask, is this a “win”?  Because true database portability is now a reality.  In LE mode, it is possible to pick up a HANA database built on Intel, make no modifications at all, and drop it on a Power box.  This removes a major barrier to customers that might have considered a move but were unwilling to deal with the hassle, time requirements, effort and cost of an export/import.  Of course, the destination will be HANA 2.0, so an upgrade from HANA 1.0 to 2.0 on the source system will be required prior to a move to Power among various other migration options.   This subject will likely be covered in a separate blog post at a later date.  This also means that customers that want to test how HANA will perform on Power compared to an incumbent x86 system will have a far easier time doing such a PoC.

Win 3 – Support for BW on the IBM E850C @ 50GB/core allowing this system to now support 2.4TB.[ii]  The previous limit was 32GB/core meaning a maximum size of 1.5TB.  This is a huge, 56% improvement which means that this, already very competitive platform, has become even stronger.

Win 4 – Saving the best for last, SAP announced support for Suite on HANA (SoH) and S/4HANA of up to 16TB with 144 cores on IBM Power E880 and E880C systems.ii  Several very large customers were already pushing the previous 9TB boundary and/or had run the SAP sizing tools and realized that more than 9TB would be required to move to HANA.  This announcement now puts IBM Power Systems on an even footing with HPE Superdome X.  Only the lame duck SGI UV 300H has support for a larger single image size @ 20TB, but not by much.  Also notice that to get to 16TB, only 144 cores are required for Power which means that there are still 48 cores unused in a potential 192 core systems, i.e. room for growth to a future limit once appropriate KPIs are met.  Consider that the HPE Superdome X requires all 16 sockets to hit 16TB … makes you wonder how they will achieve a higher size prior to a new chip from Intel.

Win 5 – Oops, did I say there were only 4 major wins?  My bad!  Turns out there is a hidden win in the prior announcement, easily overlooked.  Prior to this new, higher memory support, a maximum of 96GB/core was allowed for SoH and S/4HANA workloads.  If one divides 16TB by 144 cores, the new ratio works out to 113.8GB/core or an 18.5% increase.  Let’s do the same for HPE Superdome X.  16 sockets times 24 core/socket = 384 cores.  16TB / 384 cores = 42.7GB/core.  This implies that a POWER8 core can handle 2.7 times the workload of an Intel core for this type of workload.  Back in July, I published a two-part blog post on scaling up large transactional workloads.[iii]  In that post, I noted that transactional workloads access data primarily in rows, not in columns, meaning they traverse columns that are typically spread across many cores and sockets.  Clearly, being able to handle more memory per core and per socket means that less traversing is necessary resulting in a high probability of significantly better performance with HANA on Power compared to competing platforms, especially when one takes into consideration their radically higher ccNUMA latencies and dramatically lower ccNUMA bandwidth.

Taken together, these announcements have catapulted HANA on IBM Power Systems from being an outstanding option for most customers, but with a few annoying restrictions and limits especially for larger customers, to being a best-of-breed option for all customers, even those pushing much higher limits than the typical customer does.

[i] https://launchpad.support.sap.com/#/notes/2235581

[ii] https://launchpad.support.sap.com/#/notes/2188482

[iii] https://saponpower.wordpress.com/2016/07/01/large-scale-up-transactional-hana-systems-part-1/

Advertisements

December 6, 2016 Posted by | Uncategorized | , , , , , , , , , , , , , , , , , , , | 3 Comments

How to ensure Business Suite on HANA infrastructure is mission critical ready

Companies that plan on running Business Suite on HANA (SoH) require systems that are at least as fault tolerant as their current mission critical database systems.  Actually, the case can be made that these systems have to exceed current reliability design specifications due to the intrinsic conditions of HANA, most notably, but not limited to, extremely large memory sizes.  Other factors that will further exacerbate this include MCOD, MCOS, Virtualization and the new SPS09 feature, Multi-Tenancy.

A customer with 5TB of data in their current uncompressed Suite database will most likely see a reduction due to HANA compression (SAP note 1793345, and the HANA cookbook²) bringing their system size, including HANA work space, to roughly 3TB.  That same customer may have previously been using a database buffer of 100GB +/- 50GB.  At a current buffer size of 100GB, their new HANA system will require 30 times the amount of memory as the conventional database did.  All else being equal, 30x of any component will result in 30x failures.  In 2009, Google engineers wrote a white paper in which they noted that 8% of DIMMS experienced errors every year with most being hard errors and that when a correctable error occurred in a DIMM, there was a much higher chance that another would occur in that same DIMM leading, potentially, to uncorrectable errors.¹  As memory technology has not changed much since then, other than getting denser which could lead to even more likelihood of errors due to cosmic rays and other sources, the risk has likely not decreased.  As a result, unless companies wish to take chances with their most critical asset, they should elect to use the most reliable memory available.

IBM provides exactly that, the best of breed open systems memory reliability, not as an option at a higher cost, but included with every POWER8 system, from the one and two socket scale-out systems to even more advanced capabilities with the 4 & 8-socket systems, some of which will scale to 16-sockets (announced as a Statement of Direction for 2015).  This memory protection is represented in multiple discreet features that work together to deliver unprecedented reliability.  The following gets into quite a bit of technical detail, so if you don’t have your geek hat on, (mine can’t be removed as it was bonded to my head when I was reading Heinlein in 6th grade; yes, I know that dates me), then you may want to jump to the conclusions at the end.

Chipkill – Essentially a RAID like technology that spans data and ECC recovery information across multiple memory chips such that in the event of a chip failure, operations may continue without interruption.   Using x8 chips, Chipkill provides for Single Device Data Correction (SDDC) and with x4 chips, provides Double Device Data Correction (DDDC) due to the way in which data and ECC is spread across more chips simultaneously.

Spare DRAM modules – Each rank of memory (4 ranks per card on scale-out systems, 8 ranks per card on enterprise systems) contains an extra memory chip.  This chip is used to automatically rebuild the data that was held, previously, on the failed chip in the above scenario.  This happens transparently and automatically.  The effect is two-fold:  One, once the recovery is complete, no additional processing is required to perform Chipkill recovery allowing performance to return to pre-failure levels; Two, maintenance may be deferred as desired by the customer as Chipkill can, yet again, allow for uninterrupted operations in the event of a second memory chip failure and, in fact, IBM does not even make a call out for repair until a second chip fails.

Dynamic memory migration and Hypervisor memory mirroring – These are unique technologies only available on IBM’s Enterprise E870 and E880 systems.  In the event that a DIMM experiences errors that cannot be permanently corrected using sparing capability, the DIMM is called out for replacement.  If the ECC is capable of continuing to correct the errors, the call out is known as a predictive callout indicating the possibility of a future failure.  In such cases, if an E870 or E880 has unlicensed or unassigned DIMMS with sufficient capacity to handle it, logical memory blocks using memory from a predictively failing DIMM will be dynamically migrated to the spare/unused capacity. When this is successful this allows the system to continue to operate until the failing DIMM is replaced, without concern as to whether the failing DIMM might cause any future uncorrectable error.  Hypervisor memory mirroring is a selective mirroring technology for the memory used by the hypervisor which means that even a triple chip failure in a memory DIMM would not affect the operations of the hypervisor as it would simply start using the mirror.

L4 cache – Instead of conventional parity or ECC protected memory buffers used by other vendors, IBM utilizes special eDRAM (a more reliable technology to start with) which not only offers dramatically better performance but includes advanced techniques to delete cache lines for persistent recoverable and non-recoverable fault scenarios as well as to deallocate portions of the cache spanning multiple cache lines.

Extra memory lane – the connection from memory DIMMs or cards is made up of dozens of “lanes” which we can see visually as “pins”.  POWER8 systems feature an extra lane on each POWER8 chip.  In the event of an error, the system will attempt to retry the transfer, use ECC correction and if the error is determined by the service processor to be a hard error (as opposed to a soft/transient error), the system can deallocate the failing lane and allocate the spare lane to take its place.  As a result, no downtime in incurred and planned maintenance may be scheduled at a time that is convenient for the customer since all lanes, including the “replaced” one are still fully protected by ECC.

L2 and L3 Caches likewise have an array of protection technology including both cache line delete and cache column repair in addition to ECC and special hardening called “soft latches” which makes these caches less susceptible to soft error events.

As readers of my blog know, I rarely point out only one side of the equation without the other and in this case, the contrast to existing HANA capable systems could not be more dramatic making the symbol between the two sides a very big > symbol; details to follow.

Intel offers a variety of protection technologies for memory but leaves the decision as to which to employ up to customers.  This ranges from “performance mode” which has the least protection to “RAS mode” which has more protection at the cost of reduced performance.

Let’s start with the exclusives for IBM:  eDRAM L4 cache with its inherent superior protection and performance over conventional memory buffer chips, dynamic memory migration and hypervisor memory mirroring available on IBM Enterprise class servers, none of which are available in any form on x86 servers.  If these were the only advantages for Power Systems, this would already be compelling for mission critical systems, but this is only the start:

Lock step – Intel included similar technology to Chipkill in all of their chips which they call Lock step.  Lock step utilizes two DIMMs behind a single memory buffer chip to store a 64-byte cache line + ECC data instead of the standard single DIMM to provide 1x or 2x 8-bit error detection and 8-bit error correction within a single x8 or x4 DRAM respectively (with x4 modules, this is known as Double Device Data Correction or DDDC and is similar to standard POWER Chipkill with x4 modules.)  Lock Step is only available in RAS mode which incurs a penalty relative to performance mode.  Fujitsu released a performance white paper³ in which they described the results of a memory bandwidth benchmark called STREAM in which they described Lock step memory as running at only 57% of the speed of performance mode memory.

Lock step is certainly an improvement over standard or performance mode in that most single device events can be corrected on the fly (and two such events serially for x4 DIMMS) , but correction incurs a performance penalty above and beyond that incurred from being in Lock step mode in the first place.  After the first such failure, for x8 DIMMS, the system cannot withstand a second failure in that Lockstep pair of DIMMS and a callout for repair (read this as make a planned shutdown as soon as possible) be made to prevent a second and fatal error.  For x4 DIMMS, assuming the performance penalty is acceptable, the planned shutdown could be postponed to a more convenient time.  Remember, with the POWER spare DRAMS, no such immediate action is required.

Memory sparing – Since taking an emergency shutdown is unacceptable for a SoH system, Lock Step memory is therefore insufficient since it handles only the emergency situation but does not eliminate the need for a repair action (as the POWER memory spare does) and it incurs a performance penalty due to having to “lash” together two cards to act as one (as compared to POWER that achieves superior reliability with a single memory card).  Some x86 systems offer memory sparing in which one rank per memory channel is configured as a spare.  For instance, with the Lenovo System x x3850, each memory channel supports 3 DIMMs or ranks.  If sparing is used, the effective memory throughput of the system is reduced by 1/3 since one of every 3 DIMMs is no longer available for normal operations and the memory that must be purchased is increased by 50%.  In other words, 1TB of usable memory requires 1.5TB of installed memory.  The downsize of sparing is that it is a predictive failure technology, not a reactive one.  According to the IBM X6 Servers: Technical Overview Redbook-  “Sparing provides a degree of redundancy in the memory subsystem, but not to the extent of mirroring. In contrast to mirroring, sparing leaves more memory for the operating system. In sparing mode, the trigger for failover is a preset threshold of correctable errors. When this threshold is reached, the content is copied to its spare. The failed rank is then taken offline, and the spare counterpart is activated for use.”  In other words, this works best when you can see it coming, not after a part of the memory has failed.    When I asked a gentleman manning the Lenovo booth at TechEd && d-code about sparing, he first looked at me as if I had a horn sticking out of my head and then replied that almost no one uses this technology.  Now, I think I understand why.  This is a good option, but at a high cost and still falls short of POWER8 memory protection which is both predictive and reactive and dynamically responds to unforeseen events.  By comparison, memory sparing requires a threshold to be reached and then enough time to be available to complete a full rank copy, even if only a single chip is showing signs of imminent failure.

Memory mirroring – This technology utilizes a complete second set of memory channels and DIMMs to maintain a second copy of memory at all times.  This allows for a chip or an entire DIMM to fail with no loss of data as the second copy immediately takes over.  This option, however, does require that you double the amount of memory in the system, utilize plenty of system overhead to keep the pairs synchronized and take away ½ of the memory bandwidth (the other half of which goes to the copy).  This option may perform better than the memory sparing option because reads occur from both copies in an interleaved manner, but writes have to occur to both synchronously.

Conclusions:

Memory mirroring for x86 systems is the closest option to the continuous memory availability that POWER8 delivers.  Of course, having to purchase 2TB of memory in order to have proper protection of 1TB of effective memory adds a significant cost to the system and takes away substantial memory bandwidth.  HANA utilizes memory as few other systems do.

The problem is that x86 vendors won’t tell customers this.  Why?  Now, I can only speculate, but that is why I have a blog.  The x86 market is extremely competitive.  Most customers ask multiple vendors to bid on HANA opportunities.  It would put a vendor at a disadvantage to include this sort of option if the customer has not required it of all vendors.  In turn, x86 vendors don’t won’t to even insinuate that they might need such additional protection as that would imply a lack of reliability to meet mission critical standards.

So, let’s take this to the next logical step.  If a company is planning on implementing SoH using the above protection, they will need to double their real memory.  Many customers will need 4TB, 8TB or even some in the 12TB to 16TB range with a few even larger.  For the 4TB example, an 8TB system would be required which, as of the writing of this blog post, is not currently certified by SAP.  For the 8TB example, 16TB would be required which exceeds most x86 vendor’s capabilities.  At 12TB, only two vendors have even announced the intention of building a system to support 24TB and at 16TB, no vendor has currently announced plans to support 32TB of memory.

Oh, by the way, Fujitsu, in the above referenced white paper, measured the memory throughput of a system with memory mirroring and found it to be 69% that of a performance optimized system.  Remember, HANA demands extreme memory throughput and benchmarks typically use the fastest memory, not necessarily the most reliable meaning that if sizings are based on benchmarks, they may require adjustment when more reliable memory options are utilized.  Would larger core counts then be required to drive the necessary memory bandwidth?

Clearly, until SAP writes new rules to accommodate this necessary technology or vendors run realistic benchmarks showing just how much cpu and memory capacity is needed to support a properly mirrored memory subsystem on an x86 box, customers will be on their own to figure out what to do.

That guess work will be removed once HANA on Power GAs as it already includes the mission critical level of memory protection required for SoH and does so without any performance penalty.

Many thanks to Dan Henderson, IBM RAS expert extraordinaire, from whom I liberally borrowed some of the more technically accurate sentences in this post from his latest POWER8 RAS whitepaper¹¹ and who reviewed this post to make sure that I properly represented both IBM and non-IBM RAS options.

¹ http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
² https://cookbook.experiencesaphana.com/bw/operating-bw-on-hana/hana-database-administration/monitoring-landscape/memory-usage/
³ http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CD0QFjAA&url=http%3A%2F%2Fdocs.ts.fujitsu.com%2Fdl.aspx%3Fid%3D8ff6579c-966c-4bce-8be0-fc7a541b4a02&ei=t9VsVIP6GYW7yQTGwIGICQ&usg=AFQjCNHS1fOnd_QAnVV6JjRju9iPlAZkQg&bvm=bv.80120444,d.aWw
¹¹ http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=WH&infotype=SA&appname=STGE_PO_PO_USEN&htmlfid=POW03133USEN&attachment=POW03133USEN.PDF#loaded.

November 19, 2014 Posted by | Uncategorized | , , , , , , , , | 2 Comments

The top 3 things that SAP needs are memory, memory and I can’t remember the third. :-) A review of the IBM Power Systems announcements with a focus on the memory enhancements.

While this might not exactly be new news, it is worthwhile to consider the value of the latest Power Systems announcements for SAP workloads.  On October 12, 2011, IBM released a wide range of enhancements to the Power Systems family.  The ones that might have received the most publicity, not to mention new model numbers, were valuable but not the most important part of the announcement, from my point of view.  Yes, the new higher MHz Power 770 and 780 and the ability to order a 780 with 2 chips per socket thereby allowing the system to grow to 96 cores were certainly very welcome additions to the family.  Especially nice was that the 3.3 GHz processors in the new MMC model of the 770 came in at the same price as the 3.1 GHz processors in the previous MMB model.  So, 6.5% more performance at no additional cost.

For SAP, however, raw performance often takes second fiddle to memory.   The old rule is that for SAP workloads, we run out of memory long before we run out of CPU.   IBM started to address this issue in 2010 with the announcement of the Active Memory Expansion (AME)  feature of POWER7 systems.  This feature allows for dynamic compression/decompression of memory pages thereby making memory appear to be larger than it really is.   The administrator of a system can select the target “expansion” and the system will then build a “compressed” pool in memory into which pages are compressed and placed starting from those pages less frequently accessed to those more frequently accessed.  As pages are touched, they are uncompressed and moved into the regular memory pool from which they are accessed normally.  Applications run unchanged as AIX performs all of the moves without any interaction or awareness required by the application.   The point at which response time, throughput or a large amount of CPU overhead starts to occur is the “knee of the curve”, i.e. slightly higher than the point at which the expansion should be set.  A tool, called AMEPAT, allows the administrator to “model” the workload prior to turning AME on, or for that matter on older hardware as long as the OS level is AIX 6.1 TL4 SP2 or later.

Some workloads will see more benefit than others.  For instance, during internal test run by IBM, the 2-tier SD benchmark showed outstanding opportunities for compression and hit 111% expansion, e.g. 10GB of real memory appears to be 21GB to the application, before response time or thoughput showed any negative effect from the compression/decompression activity.  During testing of a retail BW workload, 160% expansion was reached.  Even database workloads tend to benefit from AME.  DB2 database, which already feature outstanding compression, have seen another 30% or 40% expansion.  The reason for this difference comes from the different approaches to compression.  In DB2, if 1,000 residences or business have an address on Main Street,  Austin, Texas,  (had to pick a city so selected my own) DB2 replaces Main Street, Austin, Texas in each row with a pointer to another table that has a single row entitled Main Street, Austin, Texas.  AME, by comparison, is more of an inline compression, e.g. if it sees a repeating pattern, it can replace that pattern with a symbol that represents the pattern and how often it repeats.  Oracle recently announced that they would also support AME.  The amount of expansion with AME will likely vary from something close to DB2, if Oracle Advanced Compression is used, to significantly higher if Advanced Compression is not used since many more opportunities for compression will likely exist.

So, AME can help SAP workloads close the capacity gap between memory and CPU.  Another way to view this is that this technology can decrease the cost of Power Systems by either allowing customers to purchase less memory or to place more workloads on the same system, thereby driving up utilization and decreasing the cost per workload.  It is worthwhile to note than many x86 systems have also tried to address this gap, but as none offer anything even remotely close to AME, they have instead resorted to more DIMM slots.  While this is a good solution, it should be noted that twice the number of DIMMs requires twice the amount of power and cooling and suffers from twice the failures, i.e. TANSTAFL: there ain’t no such thing as a free lunch.

In the latest announcements, IBM introduced support for the new 32GB dimms.  This effectively doubled the maximum memory on most models, from the 710 through the 795.  Combined with AME, this decreases or eliminates the gap between memory capacity and  CPU and makes these models even more cost effective since more workloads can share the same hardware.  Two other systems received similar enhancements recently, but these were not part of the formal announcement.  The two latest blades in the Power Systems portfolio, the PS703 and the PS704, were announced earlier this year with twice the number of cores but the same memory as the PS701 and PS702 respectively.  Now, using 16GB DIMMS, the PS703/PS704 can support up to 256GB/512GB of memory making these blades very respectable especially for application server workloads.  Add to that, with the Systems Director Management Console (SDMC) AME can be implemented for blades allowing for even more effective memory per blade.   Combined, these blades have closed the price difference even further compared to similar x86 blades.

One last memory related announcement may have been largely overlooked by many because it involved an enhancement to the Active Memory Sharing (AMS) feature of PowerVM.  AMS has historically been a technology that allowed for overcommitment of memory.  While CPU overcommitment is now routine, memory overcommitment means that some % of memory pages will have to be paged out to solid state or other types of disk.  The performance penalty is well understood making this not appropriate for production workloads but potentially beneficial for many other non-prod, HA or DR workloads.  That said, few SAP customers have implemented this technology due to the complexity and performance variability that can result.  The new announcement introduces Active Memory™ Deduplication for AMS implementations.   Using this new technology, PowerVM will scan partitions after they finish booting and locate  identical pages within and across all partitions on the system.  When identical pages are detected, all copies, except one, will be removed and all memory references will point to the same “first copy” of the page.   Since PowerVM is doing this, even the OSs can be unaware of this action.  Instead, as this post processing proceeds, the PowerVM free memory counter will increase until a steady state has been reached.  Once enough memory is freed up in this manner, new partitions may be started.  It is quite easy to imagine that a large number of pages are duplicates, e.g. each instance of an OS has many read only pages which are identical and multiple instances of an application, e.g. SAP app servers, will likewise have executable pages which are identical.  The expectation is that another 30% to 40% effective memory expansion will occur for many workloads using this new technology.  One caveat however; since the scan is after a partition boots, operationally it will be important to have a phased booting schedule to allow for the dedupe process to free up pages prior to starting more partitions thereby avoiding the possibility of paging.  Early testing suggests that the dedupe process should arrive at a steady state approximately 20 minutes after partitions are booted.

The bottom line is that with the larger DIMMS, AME and AMS Memory Deduplication, IBM Power Systems are in a great position to allow customers to fully exploit the CPU power of these systems by combining even more workloads together on fewer servers.  This will effectively drive down the TCA for customers and remove what little difference there might be between Power Systems and systems from various x86 vendors.

November 29, 2011 Posted by | Uncategorized | , , , , , , , , , , , , | 4 Comments