SAPonPower

An ongoing discussion about SAP infrastructure

POWER10 – Memory Sharing and how HANA customers will benefit

As an in-memory database, SAP HANA is obviously limited by access to memory.  Having massive CPU throughput with a small amount of memory could be useful for a an HPC application that needs to crunch through trillions of operations on a small amount of data. By comparison, a HANA system typically scales up with both CPU and memory at the same time.

Intel attempted to solve this problem through the use of large scale persistent DIMMs.  Unfortunately, they delivered a completely unbalanced solution with their Cascade Lake processors that included a small incremental performance increase coupled with Optane DIMMs which are 3 to 5 times slower than DDR4 DIMMs (at best).  By the way, the new “Barlow Pass” Optane DIMMs, that will be available with next gen Copper Lake and Ice Lake systems, will reportedly only deliver 15% bandwidth improvement over today’s “Apache Pass” DIMMs.[i]  Allow me to clap with one hand at that yawner of an improvement.  Their solution is somewhat analogous to a transportation problem where a road has 2 lanes and is out of capacity.  You can increase the horsepower of each vehicle a bit and pack in far more seats in each vehicle, but it will likely to take longer to get all of the various passengers in different vehicles to their destination and they will most assuredly be much more uncomfortable.

IBM with POWER10, but comparison, attacked this problem by addressing all aspects simultaneously.  As mentioned in part 1, POWER10 sockets have the potential of delivering 3 times the workload of POWER9 sockets.  So, not a small incremental improvement as with Cascade Lake, but a massive one.  Then they increased the bandwidth to memory by at least 4x, meaning they can keep the CPUs fed with data … and in case memory can’t keep up, they added support for DDR5 and its much faster speeds and throughput.  Then they increased socket to socket communications bandwidth by a factor of four since transactional and analytic workloads, like HANA, tend to be spread across sockets and often need to access data from another socket.  And just in case the system runs out of DIMM sockets, they introduced a new capability, “memory clustering” or “memory inception”[ii] which allows memory on another physical system to be accessed remotely (more on this later) with a 50 to 100ns latency hit[iii].  And just to make sure that I/O did not become the next bottleneck, they have doubled down on their previous leadership with being the first major vendor to support PCIe Gen4 by including support for PCIe Gen5 with a potential for twice the I/O throughput.

Using the previous analogy, IBM attacked the problem by tripling the horsepower of each vehicle with lots of extra doors and comfortable seats, quadrupling the number of lanes on the road and enabled each vehicle to support tandem additions.  In other words, everybody can get to their destinations much faster and in great comfort.

So, what is this memory clustering?  Put simply, it is an IBM developed technology which enables a VM on one system to map memory allocated to it by PowerVM on another system as if it was locally attached.  In this way, a VM which requires more memory than is available on the system upon which it is running can be provided with that memory from one or more systems in the cluster.  It does this through the same PowerAXON (IBM’s SMP interconnect technology) as is used within each system across sockets.  As a result, the projected additional latency is likely to be only slightly higher than accessing memory across the NUMA fabric.

IBM described multiple different potential topologies, ranging from “Enterprise Class” at extreme bandwidth, to hub and spoke; mixing and matching CPU heavy with Memory heavy nodes to even multi-hop pod-level clustering of potentially thousands of nodes.  With POWER10 featuring a 2 Petabyte memory addressability, the possibilities are mind boggling.

For HANA workloads, I see a range of possibilities.  The idea that a customer could extend memory across systems is the utopia of “never having to say I’m sorry”.  By that, I mean that in the bad old days (current times that is), if you purchased, for example, a 2TB system with all DIMM slots used, and your HANA instance needed a tiny amount more memory than available on the system, you had three choices: 1) let HANA deal with insufficient memory and start moving columns in and out of memory with all of the associated performance impact implied, 2) move the workload to a larger system at a substantial cost and a loss of the existing investment (which always brings a smile and hug from the CFO) or 3) if possible, shut down the instance and the system, rip out all existing DIMMs and replace them with larger ones (even more disruptive and still very expensive).

With memory clustering, you could harvest unused capacity in your cluster at no incremental cost.  Or, if all memory was in use, you could resize a less important workload, e.g. a HANA sandbox VM or non-prod app server, and reallocate it to the production VM requiring more memory.  Or you could move a less important workload to a different server potentially in a different data center or perhaps a much smaller system and reuse the memory using clustering.  Or you could purchase a small, low GHz, small number of activated cores system to add to the cluster with plenty of available memory to be used by the various VMs in the cluster.  The possibilities are endless, but you will notice, having to say to management that “I blew it” was not one of the options.

Does this take the place of “Storage Class Memory” (SCM) aka persistent memory?  Not at all.  In fact, POWER10 has explicit support for SCM DIMMs.  The question is more of whether SCM technology is ready for HANA.  At 3 to 5 times worse latency than DRAM, Intel’s SCM, Optane, most certainly is not.  In fact, I call it highly irresponsible to promote a technology with barely a mention of the likely performance drawbacks as has been done by Intel and their merry band of misinformation brethren, e.g. HPE, Cisco, Dell, etc.

I prefer IBM’s more measured approach of supporting technology options, encouraging openness and ecosystem innovation, and focusing on delivering real value with solutions that make sense now as opposed to others’ approaches that can lead customers down a path where they will inevitably have to apologize later when things don’t work as promised.  I am also looking forward to 2021 to see what sort of POWER10 systems and related infrastructure options IBM will announce.

 

 

[i] https://www.tomshardware.com/news/intel-barlow-pass-dimm-3200mts-support-15w-tdp
[ii] https://www.crn.com/news/components-peripherals/ibm-power10-cpu-s-memory-inception-is-industry-s-holy-grail-
https://www.servethehome.com/ibm-power10-searching-for-the-holy-grail-of-compute/hot-chips-32-ibm-power10-memory-clustering-enterprise-scale-memory-sharing/
[iii] https://www.hpcwire.com/2020/08/17/ibm-debuts-power10-touts-new-memory-scheme-security-and-inferencing/

 

 

August 19, 2020 Posted by | Uncategorized | , , , , , , , , , , | Leave a comment

POWER10 – Opening the door for new features and radical TCO reduction to SAP HANA customers

As predicted, IBM continued its cycle of introducing a new POWER chip every four years (after a set of incremental or “plus” enhancements usually after 2 years).[i]  POWER10, revealed today at HotChips, has already been  described in copious detail by the major chip and technology publications.[ii]  I really want to talk about how I believe this new chip may affect SAP HANA workloads, but please bear with me over a paragraph or so to totally geek out over the technology.

POWER10, manufactured by Samsung using 7nm lithography, will feature up to 15 cores/chip, up from the current POWER9 maximum of 12 cores (although most systems have been shipping with chips with 11 cores or fewer enabled).  IBM found, as did Intel, that manufacturing increasingly smaller lithography with maximum GHz and maximum core count chips was extremely difficult. This resulted in an expensive process with inevitable microscopic defects in a core or two, so they have intentionally designed a 16-core chip that will never ship with 15 cores enabled. This means that they can handle a reasonable number of manufacturing defects without a substantial impact to chip quality, simply be turning off which ever core does not pass their very stringent quality tests.  This will take the maximum the number of cores in all systems up proportionately based on the number of sockets in each system.  If that was the only improvement, it would be nice, but not earth shattering as every other chip vendor also usually introduces more or faster cores with each iteration.

Of course, POWER10 will have much more than that.  Relative to POWER9, each core will feature 1.5 times the amount of L1 cache, 4x the L2 cache, 4x the TLB and 2x general and 4x matrix SIMD.  (Sorry, not going to explain each as I am already getting a lot more geeky than my audience typically prefers.). Each chip will feature over 4 times the memory bandwidth and 4x the interconnect bandwidth.  Of course, if IBM only focused on CPU performance and interconnect and memory bandwidths, then the bottleneck would naturally just be pushed to the next logical points, i.e. memory and I/O.   So, per a long-standing philosophy at IBM, they didn’t ignore these areas and instead built in support for both existing PCIe Gen 4 and emerging PCIe Gen 5 and both existing DDR4 and emerging DDR5 memory DIMMS, not to mention native support for future GDDR DIMMs.  And if this was not enough, POWER10 includes all sorts of core microarchitecture improvements which I will most definitely not get into here as most of my friends in the SAP world are probably already shaking their heads trying to understand the implications of the above improvements.

You might think with all of these enhancements, this chip would run so hot that you could heat an Olympic swimming pool with just one system, but due to a variety of process improvements, this chip actually runs at 3x the power efficiency of POWER9.

Current in-memory workloads, like SAP HANA, rarely face a performance issue due to computational performance.  It is reasonable to ask the question whether this is because these workloads are designed with an understanding of the current limitations of systems and avoid doing more computationally intense actions to avoid delivering unsatisfying performance to users who are increasingly impatient with any delays.  HANA has kind of set the standard for rapid response so delivering anything else would be seen as a failure.  Is HANA holding back on new, but very costly (in computational terms) features that could be unleashed once a sufficiently fast CPU becomes available?  If so, then POWER10 could be the catalyst for unleashing some incredible new HANA capabilities, if SAP takes advantage of this opportunity.

A related issue is that perhaps existing cores can deliver better performance, but memory has not been able to keep pace.  Remember, 128GB DRAM DIMMS are still the accepted maximum size by SAP for HANA even though 256GB DIMMS have been on the market for some time now.  As SAP has long used internal benchmarks to determine ratios of memory to CPU, could POWER10 enable to the use of much more dense memory to drive down the number of sockets required to support HANA workloads thereby decreasing TCO or enabling more server consolidation?  Remember, Power Systems all featured imbedded, hardware/hypervisor based virtualization, so adding workloads to a system is just a matter of harnessing any unused extra capacity.

IBM has not released any performance projections for HANA running on POWER10 but has provided some unrelated number crunching and AI projections.  Based on those plus the raw and incredibly impressive improvements in both the microarchitecture and number of cores, dramatic cache and TLB increases and gigantic memory and interconnect bandwidth expansions, I predict that each socket will support 2 to 3 times the size of HANA workloads that are possible today (assuming sufficient memory is installed).

In the short term, I expect that customers will be able to utilize TDI 5 relaxed sizing rules to use much larger DIMMs and amounts of memory per POWER10 system to accomplish two goals.  For customers with relatively small HANA systems, they will be able to cut in half the number of sockets required compared to existing HANA systems.  For customers that currently have large numbers of systems, they will be able to consolidate those into many fewer POWER10 sytems.  Either way, customers will see dramatic reductions in TCO using POWER10.

As to those customers that often run out of memory before they run out of CPU, stay tuned for part 2 of this blog post as we discuss perhaps the most exciting new innovation with POWER10, Memory Clustering.

 

[i] https://twitter.com/IBMPowerSystems/status/1295382644402917376
https://www.linkedin.com/posts/ibm-systems_power10-activity-6701148328098369536-jU9l
https://twitter.com/IBMNews/status/1295361283307581442
[ii] Forbes – https://www.forbes.com/sites/moorinsights/2020/08/17/ibm-discloses-new-power10-processer-at-hot-chips-conference/#12e7a5814b31
VentureBeat – https://venturebeat.com/2020/08/16/ibm-unveils-power10-processor-for-big-data-analytics-and-ai-workloads/
Reuters – https://www.reuters.com/article/us-ibm-samsung-elec/ibm-rolls-out-newest-processor-chip-taps-samsung-for-manufacturing-idUSKCN25D09L
IT Jungle – https://www.itjungle.com/2020/08/17/power-to-the-tenth-power/

 

August 17, 2020 Posted by | Uncategorized | , , , , , , , , | Leave a comment