SAPonPower

An ongoing discussion about SAP infrastructure

Optane DC Persistent Memory – Proven, industrial strength or full of hype – Detail, part 3

In this final of a three part series, we will explore the two other major “benefits” of Optane DIMMs: fast restart and TCO.

Fast restart

HANA, as an in-memory database, must be loaded into memory to perform well.  Intel, for years and, apparently up to current times, has suffered with a major bottleneck in its I/O subsystem.  As a result, loading a single terabyte of data into memory could take 10 to 20 minutes in a best-case scenario.  Anecdotally, some customers have remarked that placing superfast, all flash subsystems, such as IBM’s FlashSystem 9100, behind an Intel HANA system resulted in little improvement in load times compared to mid-range SSD subsystems.  For customers attempting to bring up a 10TB storage/20TB memory HANA system, this could result in load times measured in hours.  As a result, a faster way of getting a HANA system up and running was sorely needed.

This did not appear to be a problem for customers using IBM’s Power Systems.  Not only has Power delivered roughly twice the I/O bandwidth of Intel systems for years, but with POWER9, IBM introduced PCIe Gen4, further extending their leadership in this area.  The bottleneck is actually in the storage subsystem and number of paths that it can drive, not in the processor.  To prove this, IBM ran a test with 10 NVMe cards in PCIe slots and was able to drive load speeds into HANA of almost 1TB/min.[I].  In other words, to improve restart times, Power Systems customers need only move to faster subsystems and/or add more or faster paths.

This suggests that Intel’s motivation for NVDIMMs may be to solve a problem of their own making.  But this also raises a question of their understanding of HANA.  If a customer is running a transactional workload such as Suite on HANA, S/4 or C/4, and is using HANA System Replication, wouldn’t at least one of the pair of nodes be available at all times?  SAP supports near zero upgrades[ii], so systems, firmware, OS or even HANA itself may be updated on one of the pair of nodes while the other continues to operate, followed by a synchronization of changed data and a controlled failover so that the first node might be updated.  In this way, cold restarts of HANA, where a fast restart option might make a big difference, may be driven down into a very rare occurrence.  In other words, wouldn’t this be a better option than causing poor performance to everything due to radically slower DIMMs compared to DRAM as has been discussed in gory detail on the previous two posts of this series?

HANA also offers a quick restart option whereby HANA can be started and the database made available within minutes even though all of the columns have not yet been loaded into memory. Yes, performance will be pretty bad until all columns are loaded into memory, but for non-production systems and non-mission critical systems, this might be an acceptable option.  Lastly, with HANA 2.0 SPS04, SAP now supports fast restart with conventional memory.[iii]  This only works when the OS stays up and running, i.e. can’t be used when the system, firmware or OS is being updated, but this can be used for the vast majority of required restarts, e.g. HANA upgrades, patches and restarts when a bounce of the HANA environment is needed.  Though this is not mentioned in the help documentation, it may even be possible to patch the Linux kernel while using the fast restart option if SUSE SLES is used with their “Live Patching” function.[iv]

TCO

Optane DIMMs are less expensive than DRAM DIMMs.  List prices appears to be about 40% cheaper when comparing same size DIMMs.  Effective prices, however, may have a much smaller delta since there exists competition for DRAM meaning discounts may be much deeper than for the NVDIMMs from Intel, currently the only source.  This assumes full utilization of those NVDIMMs which may prove to be a drastically bad assumption.  Sizing guidance from SAP[v]shows that the ratio of DRAM vs. PMEM (their term for NVDIMMs) capacity can be anything from 2:1 to 1:4, but it provides no guidance as to where a given workload might fall or what sort of performance impact might result.  This means that a customer might purchase NVDIMMs with a capacity ratio of 1:2, e.g. 1TB DRAM:2TB PMEM, but might end up only being able to utilize only 512GB or 1TB PMEM due to negative performance results.  In that case, the cost of effective NVDIMMs would have instantly doubled or quadrupled and would, effectively, be more expensive than DRAM DIMMs.

But let us assume the best rather than the worst.  Even if only a 2:1 ratio works relatively well, the cost of the NVDIMMs, if sized for that ratio, would be somewhat lower than the equivalent cost of DRAM DIMMs. The problem is that memory, while a significant portion of the cost of systems, is but one element in the overall TCO of a HANA landscape.  If reducing TCO is the goal, shouldn’t all options be considered?

Virtualization has been in heavy use by most customers for years helping to drive up system utilization resulting in the need for fewer systems, decreasing network and SAN ports, reducing floor space and power/cooling and, perhaps most importantly, reducing the cost of IT management.  Unfortunately, few high end customers, other than those using IBM Power Systems can take advantage of this technology in the HANA world due to the many reasons identified in the latest of many previous posts.[vi]  Put another way, if a customer utilizes an industrial strength and proven virtualization solution for HANA, i.e. IBM PowerVM, they may be able to reduce TCO considerably[vii]and potentially much more than the relatively small improvement due to NVDIMMs.

But if driving down memory costs is the only goal, there are a couple of ideas that are less radical than using NVDIMMs worth investigating.  Depending on RTO requirements, some workloads might need an HA option, but might not require it to be ready in minutes.  If this is the case, then a cold standby server running other workloads which could be killed in the event of a system outage could be utilized, e.g. QA, Dev, Test, Sandbox, Hadoop.  Since no incremental memory would be required, memory costs would be substantially lower than that required for System Replication, even if NVDIMMs are used. IBM offers a tool called VM Recovery Manager which can instrument and automate such a configuration.

Another option worth considering, only for non-production workloads, is a feature of IBM PowerVM called Memory Deduplication.  After different VMs are started using “a shared memory pool”, the hypervisor builds a logical memory map.  It then scans the pages of each VM looking for identical memory pages at which time it uses the logical memory map to point each VM to the same real memory page thereby freeing up the redundant memory pages for use by other workloads.  If a page is subsequently changed by one of the VMs, the hypervisor simply recreates a unique real memory page for that VM. The upshot of this feature is that the total quantity of DRAM memory may be reduced substantially for workloads that are relatively static and have large amounts of duplication between them. The reason that this should not be used for production is because when the VMs start, the hypervisor has not yet had the chance to deduplicate the memory pages and, if the sum of logical memory of all VMs is larger than the total memory, paging will occur.  This will subside over time and may be of little consequence to non-production workloads, but the risk to performance for production might be considered unacceptable and, besides, “Memory over-commitment must not be used” for production HANA according to SAP.

Summary

Faster restarts than may be possible with traditional Intel systems may be achieved by using near zero HANA upgrades with System Replication, HANA fast restart or by switching to a system with a radically faster I/O subsystem, e.g. IBM Power Systems. TCO may be reduced with tried and proven virtualization technologies as provided with IBM PowerVM, cold standby systems or memory deduplication rather than experimenting with version 1.0 of a new technology with no track record, unknown reliability, poor guidance on sizing and potentially huge impacts to performance.

 

[i]https://www.ibm.com/downloads/cas/WQDZWBYJ

[ii]https://launchpad.support.sap.com/#/notes/1984882

[iii]https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.04/en-US/ce158d28135147f099b761f8b1ee43fc.html

[iv]https://launchpad.support.sap.com/#/notes/1984787

[v]https://launchpad.support.sap.com/#/notes/2786237

[vi]https://saponpower.wordpress.com/2018/09/26/vmware-pushes-past-4tb-sap-hana-limit/

[vii]https://www.ibm.com/downloads/cas/M7X2YXZD

Advertisements

June 3, 2019 Posted by | Uncategorized | , , , , , , , , , , , , , , | 1 Comment

Persistent Memory for HANA @ SapphireNow Orlando 2018

Once again, Intel and the companies that utilize their processors were all abuzz at Sapphire about Intel Optane DC Persistent Memory (PMEM).  This is the second year in a row that they have been touting this future technology and its ability to fit into a DIMM form factor and take the place of some of the main memory currently supply by DRAM. I was intrigued until I saw Hasso Plattner, at SapphireNow 2018 Orlando, explain how HANA would utilize this technology.  He showed a chart where a 6TB HANA DB startup time of 50 minutes reduced to 4 min with a 50/50 mix of standard DRAM DIMMs and the new Intel PMEM DIMMs.  As he explained it, HANA column store would reside in PMEM, while working space and delta store tables would reside in DRAM.

50 minutes down 4 minutes sounds outstanding, but let’s see if we can pull back the veil a bit.

Who created the chart?  Dr. Plattner was vague about this. He suggested that it might be from an internal test.  When I asked multiple vendors, including Intel, if any parts were available for customer testing, I was told no and that it would require Cascade Lake, Intel’s next version of the current Skylake chips, to drive these new DIMMs.  I suspect that Dr. Plattner was referring to the 50 minutes as being from an internal test.  This means that the 4 minute projection with PMEM may have come from another source, e.g. Intel.

Why 50 minutes?  It might seem reasonable to assume that if this was an internal test, that SAP knows how to configure a system properly, so it was probably using best of breed SSD technology, e.g. Intel’s SSD 750.  50 minutes works out to roughly 122GB/min after HANA SW load.  IBM published a white paper in which Power Systems achieved approximately 172GB/min (30% faster) with a typical mid-range SSD subsystem and almost 1 TB/min with NVMe based SSDs, i.e. 740% faster.[i]  In other words, if 50 minutes for 6TB is longer than acceptable, Power Systems can already deliver radically faster startup time without using esoteric and untested memory concepts.

For the Intel world, getting down from 50 minutes to 4 minutes would be quite a feat, but how often is this sort of restart likely to happen?  Assuming an SAP client is not using one of the near-zero maintenance options, this depends on the frequency of such updates but most typically a couple of times per year.  More often, for Intel customers, predictive failure analysis on memory will call out memory DIMMs for replacement once or twice a month, so more frequent reboots may be required.  Of course, using a more reliable memory technology such as that offered in Power Systems could alleviate this requirement.  It is ironic that the use of unreliable DRAM memory options in x86 systems could be the very cause of why faster restarts are needed.

Speaking of reliability, is PMEM RAID protected like disk?  The answer, based on what has been published so far, appears to be no.  In other words, if a PMEM DIMM were to fail, not only could this cause the system to fail, but since this would result in an incomplete or corrupted memory image, reload from the storage subsystem would still be required.  Even more irony that the fast restart functionality of PMEM would be of no use when PMEM itself is the cause of the outage.  Also, this would be the first commercial use of this technology. Good thing that Version 1.0 of anything usually works perfectly!

Next, let us consider the effect of having column store in PMEM.  If SAP has said it once, they have said it 1,000 times, the “H” in “HANA” refers to “High Performance”.  If you slow down access to column store by a factor of 5x or 10x, you get a cascading effect on just about every possible KPI in the system.  Wait, did I say 5x or 10x?  I hate it when I have to resort to quoting the source: “Intel senior vice president Rob Crooke and Micron CEO Mark Durcan declared 3D Xpoint to be 1,000 times faster and 1,000 times more durable than NAND[ii] and third party reviews: “latencies should push down into the 1-3us range, splitting the difference between current generation DRAM (~80-100ns) and PCIe-based Optane parts (~10us)”[iii] or “As an NVDIMM, 3D XPoint memory would have approximately 20% of the speed of standard volatile DRAM.”[iv]

I just get a chuckle out of Intel’s official comment: “Unlike traditional DRAM, Intel Optane DC persistent memory will offer the unprecedented combination of high-capacity, affordability and persistence,” Lisa Spelman. Notice, Lisa does not say High Performance … good thing that is not a goal of HANA!

Ok, enough of the facts and other analysts comments, we want speculation!  Got it. Let us speculate on what happens when memory is 5x slower (10x is just twice as bad).  Let us also assume that the 5x or 10x slower is true and that Linux does not utilize a pseudo memory mapped file system (which it does with additional overhead).

In a highly optimized and largely hypothetical world, we will only have analytics using HANA.  (Yes, I get it, that is BW or a data lake, not S/4HANA or C/4HANA but I get to determine the parameters of this made-up world.)  Let’s consider an overly simplistic query example, e.g.  select customer where revenue > 100000.  The first 64 byte block of data movement to L3 cache only takes 1us, which at current Skylake 2.5GHz speeds means a wait of 2,500 processor cycles.  For sake of argument, we are going to assume no additional latency getting into the processor, no ccNUMA effects or any other delays.  The good news is that modern architectures will predict the next access and start loading subsequent data blocks, while query processing of the first block is occurring.  Unfortunately, since the DIMM is already busy, this preload has to wait 2,500 processor cycles before it can start transferring the next block of data and 2,500 processor cycles is usually more than sufficient to start and finish any portion of the work against only 64 bytes of data.  It is really hard to imagine how queries speed will not be significantly affected by this additional latency.  Imaging taking a current HANA BW query that runs in 10 seconds and telling the users to now expect the same result in 50 or 100 seconds.  Can you imagine the revolt?

Or consider transaction processing. A typical transaction might require access to data with no-preload possible since these accesses are usually random.  So, this access gets a 5x or 10x delay which is radically faster than disk access but much slower than previously experienced with in-DRAM computing?  The trouble is that while the transaction incurs that penalty, the processor is just sitting there waiting since this is “main memory” which means that it does not issue a query and wait for an I/O interrupt to remind it to continue processing.  So, this access and every access behind it waits 2,500 cycles before continuing, assuming that everything that is required comes across in that first 64 byte chunk. Unfortunately, a transaction accesses data in rows, not in columns which means that each row transaction may involve dozens of individual column accesses, each of which will experience a 5x or 10x delay.  Now, extend that to intensely random operations such as delta merge where, instead of a sub-second interruption to individual transaction response time, there is now 5x or 10x increase on all related columnar memory writes.  I could continue to extrapolations in save points, batch and external interfaces, but you get the general idea.  At currently projected speeds, this sort of slow down in transactional performance could result in project failure.

One other point that must not be overlooked is Intel’s claim about density.  While the initial press suggested up to 10x greater density, the DIMM specifications that are currently circulating show up to 512GB per DIMM, a significant increase from the 128GB DIMM max size today (4x not 10x).  But can HANA take advantage of that increased density? Prior to Skylake, SAP certified appliances with 8-sockets could only support 8TB of memory despite many having configuration maximums of 12TB.  SAP certifications are dependent on meeting performance KPIs and there has always been a pretty direct correlation between numbers of sockets, performance per socket and amount of memory supported.  In other words, it takes more and faster cores to support more memory.  So, is it reasonable to expect that SAP will discard those KPIs and accept 5x or 10x slower speeds while also jamming 4 times as much memory per socket as is currently supported?

This is not to say that persistent memory has no place in the HANA world.  There are many places in which a 5x or 10x memory penalty is worthwhile.  Consider the case of a non-prod instance, e.g. test. If it takes 5x or 10x longer, there is little impact to the business operations of most companies, just an increase in the cost of IT and applications professionals.  This may be offset against the cost of memory and, in some cases, the math may work.  How about HA or DR? No, that does not work as HA and DR must operate like production in the case of a failure or disaster.  Certainly aged data that might otherwise reside on disk would see a radical improvement from PMEM or a radically lower cost when compared to DRAM memory in  BW extension nodes.

Also, consider that aggressive research is occurring in this field and that future technologies may reduce the penalty to only 2x the speed of DRAM.  Would that be close enough to make it worthwhile?

One final thought: The co-inventor of 3D Xpoint memory is Micron. Earlier this year, Micron and Intel decided to go their separate ways with Micron using this technology in their QuantX solutions.[v]  Micron is a member of the OpenPower Consortium.  Is it possible that they could use this technology to build their own PMEM solutions for Power Systems?  If that happens, it would certainly be fascinating to see IBM harvesting the value of PMEM without the marketing and research investment that Intel has put into this.

[i]https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=POS03155USEN

[ii]https://www.youtube.com/watch?v=O0JUCjd_t_0

[iii]https://www.pcper.com/news/Storage/Intel-Launches-Optane-DC-Persistent-Memory-DIMMs-Talks-20TB-QLC-SSDs

[iv]https://searchstorage.techtarget.com/feature/3D-XPoint-memory-stumbles-in-race-to-ditch-DRAM-RRAM-may-step-up

https://www.anandtech.com/bench/product/1967?vs=2067

[v]https://www.anandtech.com/show/12258/intel-and-micron-to-discontinue-flash-memory-partnership

https://www.micron.com/products/advanced-solutions/3d-xpoint-technology

June 22, 2018 Posted by | Uncategorized | , , , , , , , | 3 Comments