SAPonPower

An ongoing discussion about SAP infrastructure

HANA on Power hits the Trifecta!

Actually, trifecta would imply only 3 big wins at the same time and HANA on Power Systems just hit 4 such big wins.

Win 1 – HANA 2.0 was announced by SAP with availability on Power Systems simultaneously as with Intel based systems.[i]  Previous announcements by SAP had indicated that Power was now on an even footing as Intel for HANA from an application support perspective, however until this announcement, some customers may have still been unconvinced.  I noticed this on occasion when presenting to customers and I made such an assertion and saw a little disbelief on some faces.  This announcement leaves no doubt.

Win 2 – HANA 2.0 is only available on Power Systems with SUSE SLES 12 SP1 in Little Endian (LE) mode.  Why, you might ask, is this a “win”?  Because true database portability is now a reality.  In LE mode, it is possible to pick up a HANA database built on Intel, make no modifications at all, and drop it on a Power box.  This removes a major barrier to customers that might have considered a move but were unwilling to deal with the hassle, time requirements, effort and cost of an export/import.  Of course, the destination will be HANA 2.0, so an upgrade from HANA 1.0 to 2.0 on the source system will be required prior to a move to Power among various other migration options.   This subject will likely be covered in a separate blog post at a later date.  This also means that customers that want to test how HANA will perform on Power compared to an incumbent x86 system will have a far easier time doing such a PoC.

Win 3 – Support for BW on the IBM E850C @ 50GB/core allowing this system to now support 2.4TB.[ii]  The previous limit was 32GB/core meaning a maximum size of 1.5TB.  This is a huge, 56% improvement which means that this, already very competitive platform, has become even stronger.

Win 4 – Saving the best for last, SAP announced support for Suite on HANA (SoH) and S/4HANA of up to 16TB with 144 cores on IBM Power E880 and E880C systems.ii  Several very large customers were already pushing the previous 9TB boundary and/or had run the SAP sizing tools and realized that more than 9TB would be required to move to HANA.  This announcement now puts IBM Power Systems on an even footing with HPE Superdome X.  Only the lame duck SGI UV 300H has support for a larger single image size @ 20TB, but not by much.  Also notice that to get to 16TB, only 144 cores are required for Power which means that there are still 48 cores unused in a potential 192 core systems, i.e. room for growth to a future limit once appropriate KPIs are met.  Consider that the HPE Superdome X requires all 16 sockets to hit 16TB … makes you wonder how they will achieve a higher size prior to a new chip from Intel.

Win 5 – Oops, did I say there were only 4 major wins?  My bad!  Turns out there is a hidden win in the prior announcement, easily overlooked.  Prior to this new, higher memory support, a maximum of 96GB/core was allowed for SoH and S/4HANA workloads.  If one divides 16TB by 144 cores, the new ratio works out to 113.8GB/core or an 18.5% increase.  Let’s do the same for HPE Superdome X.  16 sockets times 24 core/socket = 384 cores.  16TB / 384 cores = 42.7GB/core.  This implies that a POWER8 core can handle 2.7 times the workload of an Intel core for this type of workload.  Back in July, I published a two-part blog post on scaling up large transactional workloads.[iii]  In that post, I noted that transactional workloads access data primarily in rows, not in columns, meaning they traverse columns that are typically spread across many cores and sockets.  Clearly, being able to handle more memory per core and per socket means that less traversing is necessary resulting in a high probability of significantly better performance with HANA on Power compared to competing platforms, especially when one takes into consideration their radically higher ccNUMA latencies and dramatically lower ccNUMA bandwidth.

Taken together, these announcements have catapulted HANA on IBM Power Systems from being an outstanding option for most customers, but with a few annoying restrictions and limits especially for larger customers, to being a best-of-breed option for all customers, even those pushing much higher limits than the typical customer does.

[i] https://launchpad.support.sap.com/#/notes/2235581

[ii] https://launchpad.support.sap.com/#/notes/2188482

[iii] https://saponpower.wordpress.com/2016/07/01/large-scale-up-transactional-hana-systems-part-1/

December 6, 2016 Posted by | Uncategorized | , , , , , , , , , , , , , , , , , , , | 3 Comments

Large scale-up transactional HANA systems – part 2

Part 1 of this subject detailed the challenges when sizing large scale-up transactional HANA environments.  This part will dive into the details and methodology by which customers may select a vendor lacking an independent transactional HANA benchmark.

Past history with large transactional workloads

Before I start down this path, first it would be useful to understand why it is relevant.  HANA transaction processing utilizes many of the same techniques as a conventional database.  It accesses rows, albeit each column is physically separate, the transaction does not know this and gets all of the data together in one place prior to presenting the results to the dialog calling it.  Likewise, a write must follow ACID properties including only one update against a piece of data can occur at any time requiring that cache coherency mechanisms are employed to ensure this.  And a write to a log in addition to the memory location of the data to be changed or updated must occur.  Sounds an awful lot like a conventional DB which is why past history handling these sorts transactional workloads makes plenty of sense.

HPE has a long history with large scale transactional workloads and Superdome systems, but this was primarily based on Integrity Superdome systems using Itanium processors and HP-UX not with Intel x86 systems and Linux.  Among the Fortune 100, approximately 20 customers utilized HPE’s systems for their SAP database workloads almost entirely based on Oracle with HP-UX.  Not bad and coming in second place to IBM Power Systems with approximately 40 of the Fortune 100 customers that use SAP.  SGI has exactly 0 of those customers.  Intel x86 systems represent 8 of that customer set with 2 being on Exadata, not even close to a standard x86 implementation with its Oracle RAC and highly proprietary storage environment.  Three of the remaining x86 systems are utilized by vendors whose very existence is dependent on x86 so running on anything else would be a contradictory to their mission and these customers must make this solution work no matter what the expense and complexity might be.  That leaves 3 customers, none of which utilize Superdome X technology for their database systems.  To summarize, IBM Power has a robust set of high end current SAP transactional customers; HPE a smaller set entirely based on a different chip and OS than is offered with Superdome X; SGI has no experience in this space whatsoever; and x86 in general has limited experience confined to designs that have nothing in common with today’s high end x86 technology.

Industry Standard Benchmarks

A bit of background.  Benchmarks are lab experiments open to optimization and exploitation by experts in the area and have little resemblance to reality.  Unfortunately, it is the only third party metric by which systems can be compared.  Benchmarks fall into two general categories, those that are horrible and those that are not horrible (note I did not say good).  Horrible ones sometimes test nothing but the speed of CPUs by placing the entire running code in instruction cache and the entire read-only dataset upon which the code executes in data cache meaning no network and disk much less any memory I/O or cache coherency.   SPEC benchmarks such as SPECint2006 and SPECint_rate2006 fall into this category.  They are uniquely suited for ccNUMA systems as there is absolutely no communication between any sockets meaning this represents the best case scenario for a ccNUMA system.

It is therefore revealing that SGI, with 32 sockets and 288 cores, was only able to achieve 11,400 on this ideal ccNUMA benchmark, slightly beating HP Superdome X’s result of 11,100, also with 288 cores.  By comparison, the IBM Power Systems E880 with only 192 cores, i.e. 2/3 of the cores, achieved 14,400, i.e. 26% better performance.

In descending order from horrible to not as bad, there are other benchmarks which can be used to compare systems.  The list of benchmarks includes SAP SD 2-tier, SAP BW-EML, TPC-C and SAP 3-tier.  Of those, the SD 2-tier has the most participation among vendors and includes real SAP code and a real database, but suffers from the database being a tiny percentage of the workload, approximately 6 to 8%, meaning on ccNUMA systems, multiple app servers can be placed on each system board resulting in only database communication going across a pretty darned fast network represented by the ccNUMA fabric.  SGI is a no-show on this benchmark.  HPE did show with Superdome X @ 288 cores and achieved 545,780 SAPS (100,000 users, Ref# 2016002), and still the world record holder.  IBM Power showed up with the E870, an 80 core systems (28% of the number of cores as the HPE system) and achieved 436,100 SAPS (79,750 users, Ref# 2014034) (80% of the SAPS of the HPE system).  Imagine what IBM would have been able to achieve with this almost linearly scalable benchmark had they attempted to run it on the E880 with 192 cores (probably close to 436,100 * 192/80 although it is not allowed for any vendor to publish the “results” of any extrapolations of SAP benchmarks but no one can stop a customer from inputting those numbers into a calcuator).

BW-EML was SAP’s first benchmark designed for HANA, although not restricted to it.  As the name implies, it is a BW benchmark, so it is difficult to derive any correlation to transaction processing, but at least it does show some aspect of performance with HANA, analytic if nothing else and concurrent analytics is one of the core value propositions of HANA.  HPE was a frequent contributor to this benchmark, but always with something other than Superdome X.  It is important to note that Superdome X is the only Intel based system to utilize RAS mode or Intel Lockstep, by default, not as an option.  That mode has a memory throughput impact of 40% to 60% based on published numbers from a variety of vendors, but, to date, no published benchmarks, of any sort, have been run in this mode.  As a result, it is impossible to predict how well Superdome X might perform on this benchmark.  Still, kudos to HPE for their past participation.  Much better than SGI which is, once again, a no-show on this benchmark.  IBM Power Systems, as you might predict, still holds the record for best performance on this benchmark with the 40 core E870 system @ 2 Billion rows.

TPC-C was a transaction processing benchmark that, at least for some time period, had good participation, including from HP Superdome.  That is, until IBM embarrassed HPE so much, by delivering 50% more performance with ½ the number of cores.  After this, HPE never published another result on Superdome … and that was back in the 2007/2008 time frame.  TPC-C was certainly not a perfect benchmark, but it did have real transactions with real updates and about 10% of the benchmark involved remote accesses.  Still, SGI was a no-show and HPE stopped publishing on this level of system in 2007 while IBM continued publishing through 2010 until there was no one left to challenge their results.  A benchmark is only interesting when multiple vendors are vying for the top spot.

Last, but certainly not least, is the SAP SD 3-tier benchmark.  In this one, the database was kept on a totally separate server and there was almost no way to optimize it to remove any ccNUMA effects.  Only IBM had the guts to participate in this benchmark at a large scale with a 64-core POWER7+ system (the previous generation to POWER8).  There was no submission from HPE that came even remotely close and, once again, SGI was MIA.

Architecture

Where IBM Power Systems utilizes a “glueless” interconnect up to 16 sockets, meaning all processor chips connect to each other directly, without the use of specialized hub chips or switches, Intel systems beyond 8 sockets utilize a “glued” architecture.  Currently, only HPE and SGI offer solutions beyond 8 sockets.  HPE is using a very old architecture in the Superdome X, first deployed for PA-RISC (remember those) in the Superdome introduced in 2000.  Back then, they were using a cell controller (a.k.a. hub chip) on each system board.  When they introduced the Itanium processor in 2002, they replaced this hub chip with a new one called SX1000; basically an ASIC that connected the various components on the system board together and to the central switch by which it communicats with other system boards.  Since 2002, HPE has moved through three generations of ASICs and now is using the SX3000 which features considerably faster speeds, better reliability, some ccNUMA enhancements and connectivity to multiple interconnect switches.  Yes, you read that correctly; where Intel has delivered a new generation of x86 chips just about every year over the last 14 years, HPE has delivered 3 generations of hub chips.  Pace of innovation is clearly directly tied to volume and Superdome has never achieved sufficient volume alone nor use by other vendors to increase the speed of innovation.  This means that while HPE may have delivered a major step forward at a particular point in time, it suffers from a long lag and diminishing returns as time and Intel chip generations progress.  The important thing to understand is that every remote access, from either of the two Intel EX chips on each system board, to cache, memory or I/O connected to another system board, must pass through 8 hops, at a minimum, i.e. from calling socket, to SX3000, to central switch to remote SX3000, to remote socket and the same trip in return and that is assuming that data was resident in an on-board cache.

SGI, the other player in the beyond 8 socket space, is using a totally different approach, derived from their experience in the HPC space.  They are also using a hub chip, called a HARP ASIC, but rather than connecting through one or more central switches, in the up to 32 socket systems UV 300H system, each system board, featuring 4 Intel EX chips and a proprietary ASIC per memory riser, includes two hub chips which are linked directly to each of the other hub chips in the system.  This mesh is hand wired with a separate physical cable for every single connection.  Again, you read that correctly, hand wired.  This means that not only are physical connections made for every hub chip to hub chip connection with the inherent potential for an insertion or contact problem on each end of that wire, but as implementation size increases, say from 8-sockets/2 boards to 16-sockets/4 boards or to 32-sockets/8 boards, the number of physical, hand wired connections increases exponentially.   OK, assuming that does not make you just a little bit apprehensive, consider this:  Where HPE uses a memory protection technology called Double Device Data Correction + 1 (DDDC+1) in their Superdome X system, basically the ability to handle not just a single memory chip failure but at least 2 (not at the same time), SGI utilizes SDDC, i.e. Single device data correction.  This means that after detection of the first failure, customers must rapidly decide whether to shut down the system and replace the failing memory component (assuming it has been accurately identified), or hope their software based page deallocation technology works fast enough to avert a catastrophic system failure due to a subsequent memory failure.  Even with that software, if a memory fault occurs in a different page, the SGI system would still be exposed.    My personal opinion is that memory protection is so important in any system, but especially in large scale scale-up HANA systems, that anything short of true enterprise memory protection of at least DDDC is doing nothing other than increasing customer risk.

Summary

SGI is asking customers to accept their assertions that SAP’s certification of the SGI UV 300H at 20TB implies they can scale better than any other platform and perform well at that level, but they are providing no evidence in support of that claim.  SAP does not publish the criteria with which is certifies a solution, so it is possible that SGI has been able to “prove” addressability at 20TB, the ability to initialize a HANA system and maybe even to handle a moderate number of transactions.  Lacking any sort of independent, auditable proof via a benchmark, any reasonable body of customers (one would be nice at least) driving high transaction volumes with HANA or a conventional database and anything other than a 4-bit wide, hand wired ccNUMA nest that would seem prone to low throughput and high error rates, especially with substandard memory protection, it is hard to imagine why anyone would find this solution appealing.

HPE, by comparison, does have some history in transactional systems at high transactional volumes with a completely different CPU, OS and memory architecture, but nothing with Superdome X.  HPE has a few benchmarks, however poor, once again on systems from long ago plus mediocre results with the current generation and an architecture that has a minimum of 8-hops round trip for every remote access.  On the positive side, at least HPE gets it regarding proper memory protection, but does not address how much performance degradation results from this protection.  Once again, SAP’s certification at 16TB for Superdome X must be taken with the same grain of salt as SGI’s.

IBM Power Systems has an outstanding history with transactional systems at very high transactional volumes using current generation POWER8 systems.  Power also dominates the benchmark space and continued to deliver better and better results until no competitor dared risk the fight.  Lastly, POWER8 is latest generation of a chip designed from the ground up with ccNUMA optimization in mind and with reliability as its cornerstone, i.e. the results already include any overhead necessary to support this level of RAS.  Yes, POWER8 is only supported at 9TB today for SAP SoH and S/4HANA, but lest we forget, it is the new competitor in the HANA market and the other guys only achieved their higher supported numbers after extensive customer and internal benchmark testing, both of which are underway with Power.

July 7, 2016 Posted by | Uncategorized | , , , , , , , , , , , , , , , | Leave a comment

Large scale-up transactional HANA systems – part 1

Customers which require Suite on HANA (SoH) and S/4HANA systems with 6TB of memory or less will find a wide variety of available options.  Those options do not require any specialized type of hardware, just systems that can scale up to 8 sockets with Intel based systems and up to 64 cores with IBM Power Systems (socket count depends on the number of active cores per socket which varies by system).  If you require 6TB or less or can’t imagine ever needing more, then sizing is a fairly easy process, i.e. look at the sizing matrix from SAP and select a system which meets your needs.  If you need to plan for more than 6TB, this is where it gets a bit more challenging.   The list of options narrows to 5 vendors between 6TB and 8TB, IBM, Fujitsu, HPE, SGI and Lenovo and gets progressively smaller beyond that.

All systems with more than one socket today are ccNUMA, i.e. remote cache, memory and I/O accesses are delivered with more latency and lower bandwidth than local to the processor accesses.   HANA is highly optimized for analytics, which most of you probably already know.  The way it is optimized may not be as obvious.  Most tables in HANA are columnar, i.e. every column in a table is kept in its own structure with its own dictionary and the elements of the column are replaced with a very short dictionary pointer resulting in outstanding compression, in most cases.  Each column is placed in as few memory pages as possible which means that queries which scan through a column can run at crazy fast speeds as all of the data in the column is as “close” as possible to each other.  This columnar structure is beautifully suited for analytics on ccNUMA systems since different columns will typically be placed behind different sockets which means that only queries that cross columns and joins will have to access columns that may not be local to a socket and, even then, usually only the results have to be sent across the ccNUMA fabric.  There was a key word in the previous sentence that might have easily been missed: “analytics”.  Where analytical queries scan down columns, transactional queries typically go across rows in which, due to the structure of a columnar database, every element in located in a different column, potentially spanning across the entire ccNUMA system.  As a result, minimized latency and high cross system bandwidth may be more important than ever.

Let me stop here and give an example so that I don’t lose the readers that aren’t system nerds like myself.  I will use a utility company as an example as everyone is a utility customer.  For analytics, an executive might want to know the average usage of electricity on a given day at a given time meaning the query is composed of three elements, all contained in one table: usage, date and time.    Unless these columns are enormous, i.e. over 2 Billion rows, they are very likely stored behind a single socket with no remote accesses required.  Now, take that same company’s customer care center, where a utility consumer wants to turn on service, report an outage or find out what their last few months or years of bills have been.  In this case, all sorts of information is required to populate the appropriate screens, first name, last name, street address, city, state, meter number, account number, usage, billed amount and on and on.  Scans of columns are not required and a simple index lookup suffices, but every element is located in a different column which has to be resolved by an independent dictionary lookup/replacement of the compressed elements meaning several or several dozen communications across the systems as the columns are most likely distributed across the system.  While an individual remote access may take longer, almost 5x in a worst case scenario[i], we are still talking nanoseconds here and even 100 of those still results in a “delay” of 50 microseconds.  I know, what are you going to do while you are waiting!  Of course, a utility customer is more likely to have hundreds, thousands or tens of thousands of transactions at any given point in time and there is the problem.  An increased latency of 5x for every remote access may severely diminish the scalability of the system.

Does this mean that is not possible to scale-up a HANA transactional environment?  Not at all, but it does take more than being able to physically place a lot of memory in a system to be able to utilize it in a productive manner with good scalability.  How can you evaluate vendor claims then?  Unfortunately, the old tried and true SAP SD benchmark has not been made available to run in HANA environments.  Lacking that, you could flip a coin, believe vendor claims without proof or demand proof.  Clearly, demanding proof is the most reasonable approach, but what proof?  There are three types of proof to look at: past history with large transactional workloads, industry standard benchmarks and architecture.

In the over 8TB HANA space, there are three competitors; HPE Superdome X, SGI UV 300H and IBM Power Systems E870/E880.  I will address those systems and these three proof points in part 2 of this blog post.

[i] http://www.vldb.org/pvldb/vol8/p1442-psaroudakis.pdf

July 1, 2016 Posted by | Uncategorized | , , , , , , , , , , , , , , , | 3 Comments

Update – SAP HANA support for VMware, IBM Power Systems and new customer testimonials

The week before Sapphire, SAP unveiled a number of significant enhancements.  VMware 6.0 is now supported for a production VM (notice the lack of a plural); more on that below.  Hybris Commerce, a number of apps surrounding SoH and S/4HANA is now supported on IBM Power Systems.  Yes, you read that right.  The Holy Grail of SAP, S/4 or more specifically, 1511 FPS 02, is now supported on HoP.  Details, as always, can be found in SAP note:   2218464 – Supported products when running SAP HANA on IBM Power Systems .   The importance of this announcement, or should I say non-announcement as you had to be watching the above SAP note as I do on almost a daily basis because it changes so often, was the only place where this was mentioned.  This is not a dig at SAP as this is their characteristic way of releasing updates on availability to previously suggested intentions and is consistent as this was how they non-announced VMware 6.0 support as well.  Hasso Plattner, various SAP executives and employees, in Sapphire keynotes and other sessions, mentioned support for IBM Power Systems in almost a nonchalant manner, clearly demonstrating that HANA on Power has moved from being a niche product to mainstream.

Also, of note, Pfizer delivered an ASUG session during Sapphire including significant discussion about their use of IBM Power Systems.  I was particularly struck by how Joe Caruso, Director, ERP Technical Architecture at Pfizer described how Pfizer tested a large BW environment on both a single scale-up Power System with 50 cores and 5TB of memory and on a 6-node x86 scale-out cluster (tested on two different vendors, not mentioned in this session but probably not critical as their performance differences were negligible), 60-cores on each node with 1 master node and 4 worker nodes plus a not-standby.  After appropriate tuning, including utilizing table partitioning on all systems, including Power, the results were pretty astounding; both environments performed almost identically, executing Pfizer’s sample set, composed of 75+ queries, in 5.7 seconds, an impressive 6 to 1 performance advantage on a per core basis, not including the hot-standby node.  What makes this incredible is that the official BW-EML benchmark only shows an advantage of 1.8 to 1 vs. the best of breed x86 competitor and another set of BW-EML benchmark results published by another x86 competitor shows scale-out to be only 15% slower than scale-up.  For anyone that has studied the Power architecture, especially POWER8, you probably know that it has intrinsics that suggest it should handle mixed, complex and very large workloads far better than x86, but it takes a customer executing against their real data with their own queries to show what this platform can really do.  Consider benchmarks to be the rough equivalent of a NASCAR race car, with the best of engineering, mechanics, analytics, etc, vs. customer workloads which, in this analogy, involves transporting varied precious cargo in traffic, on the highway and on sub-par road conditions.  Pfizer decided that the performance demonstrated in this PoC was compelling enough to substantiate their decision to implement using IBM Power Systems with an expected go-live later this year.  Also, of interest, Pfizer evaluated the reliability characteristics of Power, based in part on their use of Power Systems for conventional database systems over the past few years, and decided that a hot-standby node for Power was unnecessary, further improving the overall TCO for their BW project.  I had not previously considered this option, but it makes sense considering the rarity of Power Systems to be unable to handle predictable, or even unpredictable faults, without interrupting running workloads.  Add to this, for many, the loss of analytical environments is unlikely to result in significant economic loss.

Also in a Sapphire session, Steve Parker, Director Application Development, Kennametal, shared a very interesting story about their journey to HANA on Power.  Though they encountered quite a few challenges along the way, not the least being that they started down the path to Suite on HANA and S/4HANA prior to it being officially supported by SAP, they found the Power platform to be highly stable and its flexibility was of critical importance to them.  Very impressively, they reduced response times compared to their old database, Oracle, by 60% and reduced the run-time of a critical daily report from 4.5 hours to just 45 minutes, an 83% improvement and month end batch now completes 33% faster than before.  Kennametal was kind enough to participate in a video, available on YouTube at: https://www.youtube.com/watch?v=8sHDBFTBhuk as well as a write up on their experience at: http://www-03.ibm.com/software/businesscasestudies/us/en/gicss67sap?synkey=W626308J29266Y50.

As I mentioned earlier, SAP snuck in a non-announcement about VMware and how a single production VM is now supported with VMware 6.0 in the week prior to Sapphire.  SAP note 2315348 – describes how a customer may support a single SAP HANA VM on VMware vSphere 6 in production.  One might reasonably question why anyone would want to do this.  I will withhold any observations on the mind set of such an individual and instead focus on what is, and is not, possible with this support.  What is not possible: the ability to run multiple production VMs on a system or to mix production and non-prod.  What is possible: the ability to utilize up to 128 virtual processors and 4TB of memory for a production VM, utilize vMotion and DRS for that VM and to deliver DRAMATICALLY worse performance than would be possible with a bare-metal 4TB system.  Why?  Because 128 vps with Hyperthreading enabled (which just about everyone does) utilizes 64 cores.  To support 6TB today, a bare-metal Haswell-EX system with 144 cores is required.  If we extrapolate that requirement to 4TB, 96 cores would be required.  Remember, SAP previously explained a minimum overhead of 12% was observed with a VM vs. bare-metal, i.e. those 64 cores under VMware 6.0 would operate, at best, like 56 cores on bare-metal or 42% less capacity than required for bare-metal.  Add to this the fact that you can’t recover any capacity left over on that system and you are left with a hobbled HANA VM and lots of leftover CPU resources.  So, vMotion is the only thing of real value to be gained?  Isn’t HANA System Replication and a controlled failover a much more viable way of moving from one system to another?  Even if vMotion might be preferred, does vMotion move memory pages from source to target system using the EXACT same layout as was implemented on the source system?  I suspect the answer is no as vMotion is designed to work even if other VMs are currently running on the target system, i.e. it will fill memory pages based on availability, not based on location.  As a result, this would mean that all of the wonderful CPU/memory affinity that HANA so carefully established on the source system would likely be lost with a potentially huge impact on performance.

So, to summarize, this new VMware 6.0 support promises bad performance, incredibly poor utilization in return for the potential to not use System Replication and suffer even more performance degradation upon movement of a VM from one system to another using vMotion.  Sounds awesome but now I understand why no one at the VMware booth at Sapphire was popping Champagne or in a celebratory mood.  (Ok, I just made that up as I did not exactly sit and stare at their booth.)

May 26, 2016 Posted by | Uncategorized | , , , , , , , , | 5 Comments

How to ensure Business Suite on HANA infrastructure is mission critical ready

Companies that plan on running Business Suite on HANA (SoH) require systems that are at least as fault tolerant as their current mission critical database systems.  Actually, the case can be made that these systems have to exceed current reliability design specifications due to the intrinsic conditions of HANA, most notably, but not limited to, extremely large memory sizes.  Other factors that will further exacerbate this include MCOD, MCOS, Virtualization and the new SPS09 feature, Multi-Tenancy.

A customer with 5TB of data in their current uncompressed Suite database will most likely see a reduction due to HANA compression (SAP note 1793345, and the HANA cookbook²) bringing their system size, including HANA work space, to roughly 3TB.  That same customer may have previously been using a database buffer of 100GB +/- 50GB.  At a current buffer size of 100GB, their new HANA system will require 30 times the amount of memory as the conventional database did.  All else being equal, 30x of any component will result in 30x failures.  In 2009, Google engineers wrote a white paper in which they noted that 8% of DIMMS experienced errors every year with most being hard errors and that when a correctable error occurred in a DIMM, there was a much higher chance that another would occur in that same DIMM leading, potentially, to uncorrectable errors.¹  As memory technology has not changed much since then, other than getting denser which could lead to even more likelihood of errors due to cosmic rays and other sources, the risk has likely not decreased.  As a result, unless companies wish to take chances with their most critical asset, they should elect to use the most reliable memory available.

IBM provides exactly that, the best of breed open systems memory reliability, not as an option at a higher cost, but included with every POWER8 system, from the one and two socket scale-out systems to even more advanced capabilities with the 4 & 8-socket systems, some of which will scale to 16-sockets (announced as a Statement of Direction for 2015).  This memory protection is represented in multiple discreet features that work together to deliver unprecedented reliability.  The following gets into quite a bit of technical detail, so if you don’t have your geek hat on, (mine can’t be removed as it was bonded to my head when I was reading Heinlein in 6th grade; yes, I know that dates me), then you may want to jump to the conclusions at the end.

Chipkill – Essentially a RAID like technology that spans data and ECC recovery information across multiple memory chips such that in the event of a chip failure, operations may continue without interruption.   Using x8 chips, Chipkill provides for Single Device Data Correction (SDDC) and with x4 chips, provides Double Device Data Correction (DDDC) due to the way in which data and ECC is spread across more chips simultaneously.

Spare DRAM modules – Each rank of memory (4 ranks per card on scale-out systems, 8 ranks per card on enterprise systems) contains an extra memory chip.  This chip is used to automatically rebuild the data that was held, previously, on the failed chip in the above scenario.  This happens transparently and automatically.  The effect is two-fold:  One, once the recovery is complete, no additional processing is required to perform Chipkill recovery allowing performance to return to pre-failure levels; Two, maintenance may be deferred as desired by the customer as Chipkill can, yet again, allow for uninterrupted operations in the event of a second memory chip failure and, in fact, IBM does not even make a call out for repair until a second chip fails.

Dynamic memory migration and Hypervisor memory mirroring – These are unique technologies only available on IBM’s Enterprise E870 and E880 systems.  In the event that a DIMM experiences errors that cannot be permanently corrected using sparing capability, the DIMM is called out for replacement.  If the ECC is capable of continuing to correct the errors, the call out is known as a predictive callout indicating the possibility of a future failure.  In such cases, if an E870 or E880 has unlicensed or unassigned DIMMS with sufficient capacity to handle it, logical memory blocks using memory from a predictively failing DIMM will be dynamically migrated to the spare/unused capacity. When this is successful this allows the system to continue to operate until the failing DIMM is replaced, without concern as to whether the failing DIMM might cause any future uncorrectable error.  Hypervisor memory mirroring is a selective mirroring technology for the memory used by the hypervisor which means that even a triple chip failure in a memory DIMM would not affect the operations of the hypervisor as it would simply start using the mirror.

L4 cache – Instead of conventional parity or ECC protected memory buffers used by other vendors, IBM utilizes special eDRAM (a more reliable technology to start with) which not only offers dramatically better performance but includes advanced techniques to delete cache lines for persistent recoverable and non-recoverable fault scenarios as well as to deallocate portions of the cache spanning multiple cache lines.

Extra memory lane – the connection from memory DIMMs or cards is made up of dozens of “lanes” which we can see visually as “pins”.  POWER8 systems feature an extra lane on each POWER8 chip.  In the event of an error, the system will attempt to retry the transfer, use ECC correction and if the error is determined by the service processor to be a hard error (as opposed to a soft/transient error), the system can deallocate the failing lane and allocate the spare lane to take its place.  As a result, no downtime in incurred and planned maintenance may be scheduled at a time that is convenient for the customer since all lanes, including the “replaced” one are still fully protected by ECC.

L2 and L3 Caches likewise have an array of protection technology including both cache line delete and cache column repair in addition to ECC and special hardening called “soft latches” which makes these caches less susceptible to soft error events.

As readers of my blog know, I rarely point out only one side of the equation without the other and in this case, the contrast to existing HANA capable systems could not be more dramatic making the symbol between the two sides a very big > symbol; details to follow.

Intel offers a variety of protection technologies for memory but leaves the decision as to which to employ up to customers.  This ranges from “performance mode” which has the least protection to “RAS mode” which has more protection at the cost of reduced performance.

Let’s start with the exclusives for IBM:  eDRAM L4 cache with its inherent superior protection and performance over conventional memory buffer chips, dynamic memory migration and hypervisor memory mirroring available on IBM Enterprise class servers, none of which are available in any form on x86 servers.  If these were the only advantages for Power Systems, this would already be compelling for mission critical systems, but this is only the start:

Lock step – Intel included similar technology to Chipkill in all of their chips which they call Lock step.  Lock step utilizes two DIMMs behind a single memory buffer chip to store a 64-byte cache line + ECC data instead of the standard single DIMM to provide 1x or 2x 8-bit error detection and 8-bit error correction within a single x8 or x4 DRAM respectively (with x4 modules, this is known as Double Device Data Correction or DDDC and is similar to standard POWER Chipkill with x4 modules.)  Lock Step is only available in RAS mode which incurs a penalty relative to performance mode.  Fujitsu released a performance white paper³ in which they described the results of a memory bandwidth benchmark called STREAM in which they described Lock step memory as running at only 57% of the speed of performance mode memory.

Lock step is certainly an improvement over standard or performance mode in that most single device events can be corrected on the fly (and two such events serially for x4 DIMMS) , but correction incurs a performance penalty above and beyond that incurred from being in Lock step mode in the first place.  After the first such failure, for x8 DIMMS, the system cannot withstand a second failure in that Lockstep pair of DIMMS and a callout for repair (read this as make a planned shutdown as soon as possible) be made to prevent a second and fatal error.  For x4 DIMMS, assuming the performance penalty is acceptable, the planned shutdown could be postponed to a more convenient time.  Remember, with the POWER spare DRAMS, no such immediate action is required.

Memory sparing – Since taking an emergency shutdown is unacceptable for a SoH system, Lock Step memory is therefore insufficient since it handles only the emergency situation but does not eliminate the need for a repair action (as the POWER memory spare does) and it incurs a performance penalty due to having to “lash” together two cards to act as one (as compared to POWER that achieves superior reliability with a single memory card).  Some x86 systems offer memory sparing in which one rank per memory channel is configured as a spare.  For instance, with the Lenovo System x x3850, each memory channel supports 3 DIMMs or ranks.  If sparing is used, the effective memory throughput of the system is reduced by 1/3 since one of every 3 DIMMs is no longer available for normal operations and the memory that must be purchased is increased by 50%.  In other words, 1TB of usable memory requires 1.5TB of installed memory.  The downsize of sparing is that it is a predictive failure technology, not a reactive one.  According to the IBM X6 Servers: Technical Overview Redbook-  “Sparing provides a degree of redundancy in the memory subsystem, but not to the extent of mirroring. In contrast to mirroring, sparing leaves more memory for the operating system. In sparing mode, the trigger for failover is a preset threshold of correctable errors. When this threshold is reached, the content is copied to its spare. The failed rank is then taken offline, and the spare counterpart is activated for use.”  In other words, this works best when you can see it coming, not after a part of the memory has failed.    When I asked a gentleman manning the Lenovo booth at TechEd && d-code about sparing, he first looked at me as if I had a horn sticking out of my head and then replied that almost no one uses this technology.  Now, I think I understand why.  This is a good option, but at a high cost and still falls short of POWER8 memory protection which is both predictive and reactive and dynamically responds to unforeseen events.  By comparison, memory sparing requires a threshold to be reached and then enough time to be available to complete a full rank copy, even if only a single chip is showing signs of imminent failure.

Memory mirroring – This technology utilizes a complete second set of memory channels and DIMMs to maintain a second copy of memory at all times.  This allows for a chip or an entire DIMM to fail with no loss of data as the second copy immediately takes over.  This option, however, does require that you double the amount of memory in the system, utilize plenty of system overhead to keep the pairs synchronized and take away ½ of the memory bandwidth (the other half of which goes to the copy).  This option may perform better than the memory sparing option because reads occur from both copies in an interleaved manner, but writes have to occur to both synchronously.

Conclusions:

Memory mirroring for x86 systems is the closest option to the continuous memory availability that POWER8 delivers.  Of course, having to purchase 2TB of memory in order to have proper protection of 1TB of effective memory adds a significant cost to the system and takes away substantial memory bandwidth.  HANA utilizes memory as few other systems do.

The problem is that x86 vendors won’t tell customers this.  Why?  Now, I can only speculate, but that is why I have a blog.  The x86 market is extremely competitive.  Most customers ask multiple vendors to bid on HANA opportunities.  It would put a vendor at a disadvantage to include this sort of option if the customer has not required it of all vendors.  In turn, x86 vendors don’t won’t to even insinuate that they might need such additional protection as that would imply a lack of reliability to meet mission critical standards.

So, let’s take this to the next logical step.  If a company is planning on implementing SoH using the above protection, they will need to double their real memory.  Many customers will need 4TB, 8TB or even some in the 12TB to 16TB range with a few even larger.  For the 4TB example, an 8TB system would be required which, as of the writing of this blog post, is not currently certified by SAP.  For the 8TB example, 16TB would be required which exceeds most x86 vendor’s capabilities.  At 12TB, only two vendors have even announced the intention of building a system to support 24TB and at 16TB, no vendor has currently announced plans to support 32TB of memory.

Oh, by the way, Fujitsu, in the above referenced white paper, measured the memory throughput of a system with memory mirroring and found it to be 69% that of a performance optimized system.  Remember, HANA demands extreme memory throughput and benchmarks typically use the fastest memory, not necessarily the most reliable meaning that if sizings are based on benchmarks, they may require adjustment when more reliable memory options are utilized.  Would larger core counts then be required to drive the necessary memory bandwidth?

Clearly, until SAP writes new rules to accommodate this necessary technology or vendors run realistic benchmarks showing just how much cpu and memory capacity is needed to support a properly mirrored memory subsystem on an x86 box, customers will be on their own to figure out what to do.

That guess work will be removed once HANA on Power GAs as it already includes the mission critical level of memory protection required for SoH and does so without any performance penalty.

Many thanks to Dan Henderson, IBM RAS expert extraordinaire, from whom I liberally borrowed some of the more technically accurate sentences in this post from his latest POWER8 RAS whitepaper¹¹ and who reviewed this post to make sure that I properly represented both IBM and non-IBM RAS options.

¹ http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
² https://cookbook.experiencesaphana.com/bw/operating-bw-on-hana/hana-database-administration/monitoring-landscape/memory-usage/
³ http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CD0QFjAA&url=http%3A%2F%2Fdocs.ts.fujitsu.com%2Fdl.aspx%3Fid%3D8ff6579c-966c-4bce-8be0-fc7a541b4a02&ei=t9VsVIP6GYW7yQTGwIGICQ&usg=AFQjCNHS1fOnd_QAnVV6JjRju9iPlAZkQg&bvm=bv.80120444,d.aWw
¹¹ http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=WH&infotype=SA&appname=STGE_PO_PO_USEN&htmlfid=POW03133USEN&attachment=POW03133USEN.PDF#loaded.

November 19, 2014 Posted by | Uncategorized | , , , , , , , , | 2 Comments