SAPonPower

An ongoing discussion about SAP infrastructure

Large scale-up transactional HANA systems – part 2

Part 1 of this subject detailed the challenges when sizing large scale-up transactional HANA environments.  This part will dive into the details and methodology by which customers may select a vendor lacking an independent transactional HANA benchmark.

Past history with large transactional workloads

Before I start down this path, first it would be useful to understand why it is relevant.  HANA transaction processing utilizes many of the same techniques as a conventional database.  It accesses rows, albeit each column is physically separate, the transaction does not know this and gets all of the data together in one place prior to presenting the results to the dialog calling it.  Likewise, a write must follow ACID properties including only one update against a piece of data can occur at any time requiring that cache coherency mechanisms are employed to ensure this.  And a write to a log in addition to the memory location of the data to be changed or updated must occur.  Sounds an awful lot like a conventional DB which is why past history handling these sorts transactional workloads makes plenty of sense.

HPE has a long history with large scale transactional workloads and Superdome systems, but this was primarily based on Integrity Superdome systems using Itanium processors and HP-UX not with Intel x86 systems and Linux.  Among the Fortune 100, approximately 20 customers utilized HPE’s systems for their SAP database workloads almost entirely based on Oracle with HP-UX.  Not bad and coming in second place to IBM Power Systems with approximately 40 of the Fortune 100 customers that use SAP.  SGI has exactly 0 of those customers.  Intel x86 systems represent 8 of that customer set with 2 being on Exadata, not even close to a standard x86 implementation with its Oracle RAC and highly proprietary storage environment.  Three of the remaining x86 systems are utilized by vendors whose very existence is dependent on x86 so running on anything else would be a contradictory to their mission and these customers must make this solution work no matter what the expense and complexity might be.  That leaves 3 customers, none of which utilize Superdome X technology for their database systems.  To summarize, IBM Power has a robust set of high end current SAP transactional customers; HPE a smaller set entirely based on a different chip and OS than is offered with Superdome X; SGI has no experience in this space whatsoever; and x86 in general has limited experience confined to designs that have nothing in common with today’s high end x86 technology.

Industry Standard Benchmarks

A bit of background.  Benchmarks are lab experiments open to optimization and exploitation by experts in the area and have little resemblance to reality.  Unfortunately, it is the only third party metric by which systems can be compared.  Benchmarks fall into two general categories, those that are horrible and those that are not horrible (note I did not say good).  Horrible ones sometimes test nothing but the speed of CPUs by placing the entire running code in instruction cache and the entire read-only dataset upon which the code executes in data cache meaning no network and disk much less any memory I/O or cache coherency.   SPEC benchmarks such as SPECint2006 and SPECint_rate2006 fall into this category.  They are uniquely suited for ccNUMA systems as there is absolutely no communication between any sockets meaning this represents the best case scenario for a ccNUMA system.

It is therefore revealing that SGI, with 32 sockets and 288 cores, was only able to achieve 11,400 on this ideal ccNUMA benchmark, slightly beating HP Superdome X’s result of 11,100, also with 288 cores.  By comparison, the IBM Power Systems E880 with only 192 cores, i.e. 2/3 of the cores, achieved 14,400, i.e. 26% better performance.

In descending order from horrible to not as bad, there are other benchmarks which can be used to compare systems.  The list of benchmarks includes SAP SD 2-tier, SAP BW-EML, TPC-C and SAP 3-tier.  Of those, the SD 2-tier has the most participation among vendors and includes real SAP code and a real database, but suffers from the database being a tiny percentage of the workload, approximately 6 to 8%, meaning on ccNUMA systems, multiple app servers can be placed on each system board resulting in only database communication going across a pretty darned fast network represented by the ccNUMA fabric.  SGI is a no-show on this benchmark.  HPE did show with Superdome X @ 288 cores and achieved 545,780 SAPS (100,000 users, Ref# 2016002), and still the world record holder.  IBM Power showed up with the E870, an 80 core systems (28% of the number of cores as the HPE system) and achieved 436,100 SAPS (79,750 users, Ref# 2014034) (80% of the SAPS of the HPE system).  Imagine what IBM would have been able to achieve with this almost linearly scalable benchmark had they attempted to run it on the E880 with 192 cores (probably close to 436,100 * 192/80 although it is not allowed for any vendor to publish the “results” of any extrapolations of SAP benchmarks but no one can stop a customer from inputting those numbers into a calcuator).

BW-EML was SAP’s first benchmark designed for HANA, although not restricted to it.  As the name implies, it is a BW benchmark, so it is difficult to derive any correlation to transaction processing, but at least it does show some aspect of performance with HANA, analytic if nothing else and concurrent analytics is one of the core value propositions of HANA.  HPE was a frequent contributor to this benchmark, but always with something other than Superdome X.  It is important to note that Superdome X is the only Intel based system to utilize RAS mode or Intel Lockstep, by default, not as an option.  That mode has a memory throughput impact of 40% to 60% based on published numbers from a variety of vendors, but, to date, no published benchmarks, of any sort, have been run in this mode.  As a result, it is impossible to predict how well Superdome X might perform on this benchmark.  Still, kudos to HPE for their past participation.  Much better than SGI which is, once again, a no-show on this benchmark.  IBM Power Systems, as you might predict, still holds the record for best performance on this benchmark with the 40 core E870 system @ 2 Billion rows.

TPC-C was a transaction processing benchmark that, at least for some time period, had good participation, including from HP Superdome.  That is, until IBM embarrassed HPE so much, by delivering 50% more performance with ½ the number of cores.  After this, HPE never published another result on Superdome … and that was back in the 2007/2008 time frame.  TPC-C was certainly not a perfect benchmark, but it did have real transactions with real updates and about 10% of the benchmark involved remote accesses.  Still, SGI was a no-show and HPE stopped publishing on this level of system in 2007 while IBM continued publishing through 2010 until there was no one left to challenge their results.  A benchmark is only interesting when multiple vendors are vying for the top spot.

Last, but certainly not least, is the SAP SD 3-tier benchmark.  In this one, the database was kept on a totally separate server and there was almost no way to optimize it to remove any ccNUMA effects.  Only IBM had the guts to participate in this benchmark at a large scale with a 64-core POWER7+ system (the previous generation to POWER8).  There was no submission from HPE that came even remotely close and, once again, SGI was MIA.

Architecture

Where IBM Power Systems utilizes a “glueless” interconnect up to 16 sockets, meaning all processor chips connect to each other directly, without the use of specialized hub chips or switches, Intel systems beyond 8 sockets utilize a “glued” architecture.  Currently, only HPE and SGI offer solutions beyond 8 sockets.  HPE is using a very old architecture in the Superdome X, first deployed for PA-RISC (remember those) in the Superdome introduced in 2000.  Back then, they were using a cell controller (a.k.a. hub chip) on each system board.  When they introduced the Itanium processor in 2002, they replaced this hub chip with a new one called SX1000; basically an ASIC that connected the various components on the system board together and to the central switch by which it communicats with other system boards.  Since 2002, HPE has moved through three generations of ASICs and now is using the SX3000 which features considerably faster speeds, better reliability, some ccNUMA enhancements and connectivity to multiple interconnect switches.  Yes, you read that correctly; where Intel has delivered a new generation of x86 chips just about every year over the last 14 years, HPE has delivered 3 generations of hub chips.  Pace of innovation is clearly directly tied to volume and Superdome has never achieved sufficient volume alone nor use by other vendors to increase the speed of innovation.  This means that while HPE may have delivered a major step forward at a particular point in time, it suffers from a long lag and diminishing returns as time and Intel chip generations progress.  The important thing to understand is that every remote access, from either of the two Intel EX chips on each system board, to cache, memory or I/O connected to another system board, must pass through 8 hops, at a minimum, i.e. from calling socket, to SX3000, to central switch to remote SX3000, to remote socket and the same trip in return and that is assuming that data was resident in an on-board cache.

SGI, the other player in the beyond 8 socket space, is using a totally different approach, derived from their experience in the HPC space.  They are also using a hub chip, called a HARP ASIC, but rather than connecting through one or more central switches, in the up to 32 socket systems UV 300H system, each system board, featuring 4 Intel EX chips and a proprietary ASIC per memory riser, includes two hub chips which are linked directly to each of the other hub chips in the system.  This mesh is hand wired with a separate physical cable for every single connection.  Again, you read that correctly, hand wired.  This means that not only are physical connections made for every hub chip to hub chip connection with the inherent potential for an insertion or contact problem on each end of that wire, but as implementation size increases, say from 8-sockets/2 boards to 16-sockets/4 boards or to 32-sockets/8 boards, the number of physical, hand wired connections increases exponentially.   OK, assuming that does not make you just a little bit apprehensive, consider this:  Where HPE uses a memory protection technology called Double Device Data Correction + 1 (DDDC+1) in their Superdome X system, basically the ability to handle not just a single memory chip failure but at least 2 (not at the same time), SGI utilizes SDDC, i.e. Single device data correction.  This means that after detection of the first failure, customers must rapidly decide whether to shut down the system and replace the failing memory component (assuming it has been accurately identified), or hope their software based page deallocation technology works fast enough to avert a catastrophic system failure due to a subsequent memory failure.  Even with that software, if a memory fault occurs in a different page, the SGI system would still be exposed.    My personal opinion is that memory protection is so important in any system, but especially in large scale scale-up HANA systems, that anything short of true enterprise memory protection of at least DDDC is doing nothing other than increasing customer risk.

Summary

SGI is asking customers to accept their assertions that SAP’s certification of the SGI UV 300H at 20TB implies they can scale better than any other platform and perform well at that level, but they are providing no evidence in support of that claim.  SAP does not publish the criteria with which is certifies a solution, so it is possible that SGI has been able to “prove” addressability at 20TB, the ability to initialize a HANA system and maybe even to handle a moderate number of transactions.  Lacking any sort of independent, auditable proof via a benchmark, any reasonable body of customers (one would be nice at least) driving high transaction volumes with HANA or a conventional database and anything other than a 4-bit wide, hand wired ccNUMA nest that would seem prone to low throughput and high error rates, especially with substandard memory protection, it is hard to imagine why anyone would find this solution appealing.

HPE, by comparison, does have some history in transactional systems at high transactional volumes with a completely different CPU, OS and memory architecture, but nothing with Superdome X.  HPE has a few benchmarks, however poor, once again on systems from long ago plus mediocre results with the current generation and an architecture that has a minimum of 8-hops round trip for every remote access.  On the positive side, at least HPE gets it regarding proper memory protection, but does not address how much performance degradation results from this protection.  Once again, SAP’s certification at 16TB for Superdome X must be taken with the same grain of salt as SGI’s.

IBM Power Systems has an outstanding history with transactional systems at very high transactional volumes using current generation POWER8 systems.  Power also dominates the benchmark space and continued to deliver better and better results until no competitor dared risk the fight.  Lastly, POWER8 is latest generation of a chip designed from the ground up with ccNUMA optimization in mind and with reliability as its cornerstone, i.e. the results already include any overhead necessary to support this level of RAS.  Yes, POWER8 is only supported at 9TB today for SAP SoH and S/4HANA, but lest we forget, it is the new competitor in the HANA market and the other guys only achieved their higher supported numbers after extensive customer and internal benchmark testing, both of which are underway with Power.

Advertisement

July 7, 2016 - Posted by | Uncategorized | , , , , , , , , , , , , , , ,

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: