SAPonPower

An ongoing discussion about SAP infrastructure

POWER10 – Memory Sharing and how HANA customers will benefit

As an in-memory database, SAP HANA is obviously limited by access to memory.  Having massive CPU throughput with a small amount of memory could be useful for a an HPC application that needs to crunch through trillions of operations on a small amount of data. By comparison, a HANA system typically scales up with both CPU and memory at the same time.

Intel attempted to solve this problem through the use of large scale persistent DIMMs.  Unfortunately, they delivered a completely unbalanced solution with their Cascade Lake processors that included a small incremental performance increase coupled with Optane DIMMs which are 3 to 5 times slower than DDR4 DIMMs (at best).  By the way, the new “Barlow Pass” Optane DIMMs, that will be available with next gen Copper Lake and Ice Lake systems, will reportedly only deliver 15% bandwidth improvement over today’s “Apache Pass” DIMMs.[i]  Allow me to clap with one hand at that yawner of an improvement.  Their solution is somewhat analogous to a transportation problem where a road has 2 lanes and is out of capacity.  You can increase the horsepower of each vehicle a bit and pack in far more seats in each vehicle, but it will likely to take longer to get all of the various passengers in different vehicles to their destination and they will most assuredly be much more uncomfortable.

IBM with POWER10, but comparison, attacked this problem by addressing all aspects simultaneously.  As mentioned in part 1, POWER10 sockets have the potential of delivering 3 times the workload of POWER9 sockets.  So, not a small incremental improvement as with Cascade Lake, but a massive one.  Then they increased the bandwidth to memory by at least 4x, meaning they can keep the CPUs fed with data … and in case memory can’t keep up, they added support for DDR5 and its much faster speeds and throughput.  Then they increased socket to socket communications bandwidth by a factor of four since transactional and analytic workloads, like HANA, tend to be spread across sockets and often need to access data from another socket.  And just in case the system runs out of DIMM sockets, they introduced a new capability, “memory clustering” or “memory inception”[ii] which allows memory on another physical system to be accessed remotely (more on this later) with a 50 to 100ns latency hit[iii].  And just to make sure that I/O did not become the next bottleneck, they have doubled down on their previous leadership with being the first major vendor to support PCIe Gen4 by including support for PCIe Gen5 with a potential for twice the I/O throughput.

Using the previous analogy, IBM attacked the problem by tripling the horsepower of each vehicle with lots of extra doors and comfortable seats, quadrupling the number of lanes on the road and enabled each vehicle to support tandem additions.  In other words, everybody can get to their destinations much faster and in great comfort.

So, what is this memory clustering?  Put simply, it is an IBM developed technology which enables a VM on one system to map memory allocated to it by PowerVM on another system as if it was locally attached.  In this way, a VM which requires more memory than is available on the system upon which it is running can be provided with that memory from one or more systems in the cluster.  It does this through the same PowerAXON (IBM’s SMP interconnect technology) as is used within each system across sockets.  As a result, the projected additional latency is likely to be only slightly higher than accessing memory across the NUMA fabric.

IBM described multiple different potential topologies, ranging from “Enterprise Class” at extreme bandwidth, to hub and spoke; mixing and matching CPU heavy with Memory heavy nodes to even multi-hop pod-level clustering of potentially thousands of nodes.  With POWER10 featuring a 2 Petabyte memory addressability, the possibilities are mind boggling.

For HANA workloads, I see a range of possibilities.  The idea that a customer could extend memory across systems is the utopia of “never having to say I’m sorry”.  By that, I mean that in the bad old days (current times that is), if you purchased, for example, a 2TB system with all DIMM slots used, and your HANA instance needed a tiny amount more memory than available on the system, you had three choices: 1) let HANA deal with insufficient memory and start moving columns in and out of memory with all of the associated performance impact implied, 2) move the workload to a larger system at a substantial cost and a loss of the existing investment (which always brings a smile and hug from the CFO) or 3) if possible, shut down the instance and the system, rip out all existing DIMMs and replace them with larger ones (even more disruptive and still very expensive).

With memory clustering, you could harvest unused capacity in your cluster at no incremental cost.  Or, if all memory was in use, you could resize a less important workload, e.g. a HANA sandbox VM or non-prod app server, and reallocate it to the production VM requiring more memory.  Or you could move a less important workload to a different server potentially in a different data center or perhaps a much smaller system and reuse the memory using clustering.  Or you could purchase a small, low GHz, small number of activated cores system to add to the cluster with plenty of available memory to be used by the various VMs in the cluster.  The possibilities are endless, but you will notice, having to say to management that “I blew it” was not one of the options.

Does this take the place of “Storage Class Memory” (SCM) aka persistent memory?  Not at all.  In fact, POWER10 has explicit support for SCM DIMMs.  The question is more of whether SCM technology is ready for HANA.  At 3 to 5 times worse latency than DRAM, Intel’s SCM, Optane, most certainly is not.  In fact, I call it highly irresponsible to promote a technology with barely a mention of the likely performance drawbacks as has been done by Intel and their merry band of misinformation brethren, e.g. HPE, Cisco, Dell, etc.

I prefer IBM’s more measured approach of supporting technology options, encouraging openness and ecosystem innovation, and focusing on delivering real value with solutions that make sense now as opposed to others’ approaches that can lead customers down a path where they will inevitably have to apologize later when things don’t work as promised.  I am also looking forward to 2021 to see what sort of POWER10 systems and related infrastructure options IBM will announce.

 

 

[i] https://www.tomshardware.com/news/intel-barlow-pass-dimm-3200mts-support-15w-tdp
[ii] https://www.crn.com/news/components-peripherals/ibm-power10-cpu-s-memory-inception-is-industry-s-holy-grail-
https://www.servethehome.com/ibm-power10-searching-for-the-holy-grail-of-compute/hot-chips-32-ibm-power10-memory-clustering-enterprise-scale-memory-sharing/
[iii] https://www.hpcwire.com/2020/08/17/ibm-debuts-power10-touts-new-memory-scheme-security-and-inferencing/

 

 

August 19, 2020 Posted by | Uncategorized | , , , , , , , , , , | Leave a comment

Persistent Memory for HANA @ SapphireNow Orlando 2018

Once again, Intel and the companies that utilize their processors were all abuzz at Sapphire about Intel Optane DC Persistent Memory (PMEM).  This is the second year in a row that they have been touting this future technology and its ability to fit into a DIMM form factor and take the place of some of the main memory currently supply by DRAM. I was intrigued until I saw Hasso Plattner, at SapphireNow 2018 Orlando, explain how HANA would utilize this technology.  He showed a chart where a 6TB HANA DB startup time of 50 minutes reduced to 4 min with a 50/50 mix of standard DRAM DIMMs and the new Intel PMEM DIMMs.  As he explained it, HANA column store would reside in PMEM, while working space and delta store tables would reside in DRAM.

50 minutes down 4 minutes sounds outstanding, but let’s see if we can pull back the veil a bit.

Who created the chart?  Dr. Plattner was vague about this. He suggested that it might be from an internal test.  When I asked multiple vendors, including Intel, if any parts were available for customer testing, I was told no and that it would require Cascade Lake, Intel’s next version of the current Skylake chips, to drive these new DIMMs.  I suspect that Dr. Plattner was referring to the 50 minutes as being from an internal test.  This means that the 4 minute projection with PMEM may have come from another source, e.g. Intel.

Why 50 minutes?  It might seem reasonable to assume that if this was an internal test, that SAP knows how to configure a system properly, so it was probably using best of breed SSD technology, e.g. Intel’s SSD 750.  50 minutes works out to roughly 122GB/min after HANA SW load.  IBM published a white paper in which Power Systems achieved approximately 172GB/min (30% faster) with a typical mid-range SSD subsystem and almost 1 TB/min with NVMe based SSDs, i.e. 740% faster.[i]  In other words, if 50 minutes for 6TB is longer than acceptable, Power Systems can already deliver radically faster startup time without using esoteric and untested memory concepts.

For the Intel world, getting down from 50 minutes to 4 minutes would be quite a feat, but how often is this sort of restart likely to happen?  Assuming an SAP client is not using one of the near-zero maintenance options, this depends on the frequency of such updates but most typically a couple of times per year.  More often, for Intel customers, predictive failure analysis on memory will call out memory DIMMs for replacement once or twice a month, so more frequent reboots may be required.  Of course, using a more reliable memory technology such as that offered in Power Systems could alleviate this requirement.  It is ironic that the use of unreliable DRAM memory options in x86 systems could be the very cause of why faster restarts are needed.

Speaking of reliability, is PMEM RAID protected like disk?  The answer, based on what has been published so far, appears to be no.  In other words, if a PMEM DIMM were to fail, not only could this cause the system to fail, but since this would result in an incomplete or corrupted memory image, reload from the storage subsystem would still be required.  Even more irony that the fast restart functionality of PMEM would be of no use when PMEM itself is the cause of the outage.  Also, this would be the first commercial use of this technology. Good thing that Version 1.0 of anything usually works perfectly!

Next, let us consider the effect of having column store in PMEM.  If SAP has said it once, they have said it 1,000 times, the “H” in “HANA” refers to “High Performance”.  If you slow down access to column store by a factor of 5x or 10x, you get a cascading effect on just about every possible KPI in the system.  Wait, did I say 5x or 10x?  I hate it when I have to resort to quoting the source: “Intel senior vice president Rob Crooke and Micron CEO Mark Durcan declared 3D Xpoint to be 1,000 times faster and 1,000 times more durable than NAND[ii] and third party reviews: “latencies should push down into the 1-3us range, splitting the difference between current generation DRAM (~80-100ns) and PCIe-based Optane parts (~10us)”[iii] or “As an NVDIMM, 3D XPoint memory would have approximately 20% of the speed of standard volatile DRAM.”[iv]

I just get a chuckle out of Intel’s official comment: “Unlike traditional DRAM, Intel Optane DC persistent memory will offer the unprecedented combination of high-capacity, affordability and persistence,” Lisa Spelman. Notice, Lisa does not say High Performance … good thing that is not a goal of HANA!

Ok, enough of the facts and other analysts comments, we want speculation!  Got it. Let us speculate on what happens when memory is 5x slower (10x is just twice as bad).  Let us also assume that the 5x or 10x slower is true and that Linux does not utilize a pseudo memory mapped file system (which it does with additional overhead).

In a highly optimized and largely hypothetical world, we will only have analytics using HANA.  (Yes, I get it, that is BW or a data lake, not S/4HANA or C/4HANA but I get to determine the parameters of this made-up world.)  Let’s consider an overly simplistic query example, e.g.  select customer where revenue > 100000.  The first 64 byte block of data movement to L3 cache only takes 1us, which at current Skylake 2.5GHz speeds means a wait of 2,500 processor cycles.  For sake of argument, we are going to assume no additional latency getting into the processor, no ccNUMA effects or any other delays.  The good news is that modern architectures will predict the next access and start loading subsequent data blocks, while query processing of the first block is occurring.  Unfortunately, since the DIMM is already busy, this preload has to wait 2,500 processor cycles before it can start transferring the next block of data and 2,500 processor cycles is usually more than sufficient to start and finish any portion of the work against only 64 bytes of data.  It is really hard to imagine how queries speed will not be significantly affected by this additional latency.  Imaging taking a current HANA BW query that runs in 10 seconds and telling the users to now expect the same result in 50 or 100 seconds.  Can you imagine the revolt?

Or consider transaction processing. A typical transaction might require access to data with no-preload possible since these accesses are usually random.  So, this access gets a 5x or 10x delay which is radically faster than disk access but much slower than previously experienced with in-DRAM computing?  The trouble is that while the transaction incurs that penalty, the processor is just sitting there waiting since this is “main memory” which means that it does not issue a query and wait for an I/O interrupt to remind it to continue processing.  So, this access and every access behind it waits 2,500 cycles before continuing, assuming that everything that is required comes across in that first 64 byte chunk. Unfortunately, a transaction accesses data in rows, not in columns which means that each row transaction may involve dozens of individual column accesses, each of which will experience a 5x or 10x delay.  Now, extend that to intensely random operations such as delta merge where, instead of a sub-second interruption to individual transaction response time, there is now 5x or 10x increase on all related columnar memory writes.  I could continue to extrapolations in save points, batch and external interfaces, but you get the general idea.  At currently projected speeds, this sort of slow down in transactional performance could result in project failure.

One other point that must not be overlooked is Intel’s claim about density.  While the initial press suggested up to 10x greater density, the DIMM specifications that are currently circulating show up to 512GB per DIMM, a significant increase from the 128GB DIMM max size today (4x not 10x).  But can HANA take advantage of that increased density? Prior to Skylake, SAP certified appliances with 8-sockets could only support 8TB of memory despite many having configuration maximums of 12TB.  SAP certifications are dependent on meeting performance KPIs and there has always been a pretty direct correlation between numbers of sockets, performance per socket and amount of memory supported.  In other words, it takes more and faster cores to support more memory.  So, is it reasonable to expect that SAP will discard those KPIs and accept 5x or 10x slower speeds while also jamming 4 times as much memory per socket as is currently supported?

This is not to say that persistent memory has no place in the HANA world.  There are many places in which a 5x or 10x memory penalty is worthwhile.  Consider the case of a non-prod instance, e.g. test. If it takes 5x or 10x longer, there is little impact to the business operations of most companies, just an increase in the cost of IT and applications professionals.  This may be offset against the cost of memory and, in some cases, the math may work.  How about HA or DR? No, that does not work as HA and DR must operate like production in the case of a failure or disaster.  Certainly aged data that might otherwise reside on disk would see a radical improvement from PMEM or a radically lower cost when compared to DRAM memory in  BW extension nodes.

Also, consider that aggressive research is occurring in this field and that future technologies may reduce the penalty to only 2x the speed of DRAM.  Would that be close enough to make it worthwhile?

One final thought: The co-inventor of 3D Xpoint memory is Micron. Earlier this year, Micron and Intel decided to go their separate ways with Micron using this technology in their QuantX solutions.[v]  Micron is a member of the OpenPower Consortium.  Is it possible that they could use this technology to build their own PMEM solutions for Power Systems?  If that happens, it would certainly be fascinating to see IBM harvesting the value of PMEM without the marketing and research investment that Intel has put into this.

[i]https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=POS03155USEN

[ii]https://www.youtube.com/watch?v=O0JUCjd_t_0

[iii]https://www.pcper.com/news/Storage/Intel-Launches-Optane-DC-Persistent-Memory-DIMMs-Talks-20TB-QLC-SSDs

[iv]https://searchstorage.techtarget.com/feature/3D-XPoint-memory-stumbles-in-race-to-ditch-DRAM-RRAM-may-step-up

https://www.anandtech.com/bench/product/1967?vs=2067

[v]https://www.anandtech.com/show/12258/intel-and-micron-to-discontinue-flash-memory-partnership

https://www.micron.com/products/advanced-solutions/3d-xpoint-technology

June 22, 2018 Posted by | Uncategorized | , , , , , , , | 3 Comments