SAPonPower

An ongoing discussion about SAP infrastructure

Optane DC Persistent Memory – Proven, industrial strength or full of hype – Detail, part 3

In this final of a three part series, we will explore the two other major “benefits” of Optane DIMMs: fast restart and TCO.

Fast restart

HANA, as an in-memory database, must be loaded into memory to perform well.  Intel, for years and, apparently up to current times, has suffered with a major bottleneck in its I/O subsystem.  As a result, loading a single terabyte of data into memory could take 10 to 20 minutes in a best-case scenario.  Anecdotally, some customers have remarked that placing superfast, all flash subsystems, such as IBM’s FlashSystem 9100, behind an Intel HANA system resulted in little improvement in load times compared to mid-range SSD subsystems.  For customers attempting to bring up a 10TB storage/20TB memory HANA system, this could result in load times measured in hours.  As a result, a faster way of getting a HANA system up and running was sorely needed.

This did not appear to be a problem for customers using IBM’s Power Systems.  Not only has Power delivered roughly twice the I/O bandwidth of Intel systems for years, but with POWER9, IBM introduced PCIe Gen4, further extending their leadership in this area.  The bottleneck is actually in the storage subsystem and number of paths that it can drive, not in the processor.  To prove this, IBM ran a test with 10 NVMe cards in PCIe slots and was able to drive load speeds into HANA of almost 1TB/min.[I].  In other words, to improve restart times, Power Systems customers need only move to faster subsystems and/or add more or faster paths.

This suggests that Intel’s motivation for NVDIMMs may be to solve a problem of their own making.  But this also raises a question of their understanding of HANA.  If a customer is running a transactional workload such as Suite on HANA, S/4 or C/4, and is using HANA System Replication, wouldn’t at least one of the pair of nodes be available at all times?  SAP supports near zero upgrades[ii], so systems, firmware, OS or even HANA itself may be updated on one of the pair of nodes while the other continues to operate, followed by a synchronization of changed data and a controlled failover so that the first node might be updated.  In this way, cold restarts of HANA, where a fast restart option might make a big difference, may be driven down into a very rare occurrence.  In other words, wouldn’t this be a better option than causing poor performance to everything due to radically slower DIMMs compared to DRAM as has been discussed in gory detail on the previous two posts of this series?

HANA also offers a quick restart option whereby HANA can be started and the database made available within minutes even though all of the columns have not yet been loaded into memory. Yes, performance will be pretty bad until all columns are loaded into memory, but for non-production systems and non-mission critical systems, this might be an acceptable option.  Lastly, with HANA 2.0 SPS04, SAP now supports fast restart with conventional memory.[iii]  This only works when the OS stays up and running, i.e. can’t be used when the system, firmware or OS is being updated, but this can be used for the vast majority of required restarts, e.g. HANA upgrades, patches and restarts when a bounce of the HANA environment is needed.  Though this is not mentioned in the help documentation, it may even be possible to patch the Linux kernel while using the fast restart option if SUSE SLES is used with their “Live Patching” function.[iv]

TCO

Optane DIMMs are less expensive than DRAM DIMMs.  List prices appears to be about 40% cheaper when comparing same size DIMMs.  Effective prices, however, may have a much smaller delta since there exists competition for DRAM meaning discounts may be much deeper than for the NVDIMMs from Intel, currently the only source.  This assumes full utilization of those NVDIMMs which may prove to be a drastically bad assumption.  Sizing guidance from SAP[v]shows that the ratio of DRAM vs. PMEM (their term for NVDIMMs) capacity can be anything from 2:1 to 1:4, but it provides no guidance as to where a given workload might fall or what sort of performance impact might result.  This means that a customer might purchase NVDIMMs with a capacity ratio of 1:2, e.g. 1TB DRAM:2TB PMEM, but might end up only being able to utilize only 512GB or 1TB PMEM due to negative performance results.  In that case, the cost of effective NVDIMMs would have instantly doubled or quadrupled and would, effectively, be more expensive than DRAM DIMMs.

But let us assume the best rather than the worst.  Even if only a 2:1 ratio works relatively well, the cost of the NVDIMMs, if sized for that ratio, would be somewhat lower than the equivalent cost of DRAM DIMMs. The problem is that memory, while a significant portion of the cost of systems, is but one element in the overall TCO of a HANA landscape.  If reducing TCO is the goal, shouldn’t all options be considered?

Virtualization has been in heavy use by most customers for years helping to drive up system utilization resulting in the need for fewer systems, decreasing network and SAN ports, reducing floor space and power/cooling and, perhaps most importantly, reducing the cost of IT management.  Unfortunately, few high end customers, other than those using IBM Power Systems can take advantage of this technology in the HANA world due to the many reasons identified in the latest of many previous posts.[vi]  Put another way, if a customer utilizes an industrial strength and proven virtualization solution for HANA, i.e. IBM PowerVM, they may be able to reduce TCO considerably[vii]and potentially much more than the relatively small improvement due to NVDIMMs.

But if driving down memory costs is the only goal, there are a couple of ideas that are less radical than using NVDIMMs worth investigating.  Depending on RTO requirements, some workloads might need an HA option, but might not require it to be ready in minutes.  If this is the case, then a cold standby server running other workloads which could be killed in the event of a system outage could be utilized, e.g. QA, Dev, Test, Sandbox, Hadoop.  Since no incremental memory would be required, memory costs would be substantially lower than that required for System Replication, even if NVDIMMs are used. IBM offers a tool called VM Recovery Manager which can instrument and automate such a configuration.

Another option worth considering, only for non-production workloads, is a feature of IBM PowerVM called Memory Deduplication.  After different VMs are started using “a shared memory pool”, the hypervisor builds a logical memory map.  It then scans the pages of each VM looking for identical memory pages at which time it uses the logical memory map to point each VM to the same real memory page thereby freeing up the redundant memory pages for use by other workloads.  If a page is subsequently changed by one of the VMs, the hypervisor simply recreates a unique real memory page for that VM. The upshot of this feature is that the total quantity of DRAM memory may be reduced substantially for workloads that are relatively static and have large amounts of duplication between them. The reason that this should not be used for production is because when the VMs start, the hypervisor has not yet had the chance to deduplicate the memory pages and, if the sum of logical memory of all VMs is larger than the total memory, paging will occur.  This will subside over time and may be of little consequence to non-production workloads, but the risk to performance for production might be considered unacceptable and, besides, “Memory over-commitment must not be used” for production HANA according to SAP.

Summary

Faster restarts than may be possible with traditional Intel systems may be achieved by using near zero HANA upgrades with System Replication, HANA fast restart or by switching to a system with a radically faster I/O subsystem, e.g. IBM Power Systems. TCO may be reduced with tried and proven virtualization technologies as provided with IBM PowerVM, cold standby systems or memory deduplication rather than experimenting with version 1.0 of a new technology with no track record, unknown reliability, poor guidance on sizing and potentially huge impacts to performance.

 

[i]https://www.ibm.com/downloads/cas/WQDZWBYJ

[ii]https://launchpad.support.sap.com/#/notes/1984882

[iii]https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.04/en-US/ce158d28135147f099b761f8b1ee43fc.html

[iv]https://launchpad.support.sap.com/#/notes/1984787

[v]https://launchpad.support.sap.com/#/notes/2786237

[vi]https://saponpower.wordpress.com/2018/09/26/vmware-pushes-past-4tb-sap-hana-limit/

[vii]https://www.ibm.com/downloads/cas/M7X2YXZD

June 3, 2019 Posted by | Uncategorized | , , , , , , , , , , , , , , | 1 Comment

Optane DC Persistent Memory – Proven, industrial strength or full of hype – Detail, part 1

Several non-Intel sites suggest that Intel’s storage class memory (Lenovo abbreviates these as DCPMM, while many others refer to them with the more generic term NVDIMM) delivers a read latency of roughly 5 times slower than DRAM, e.g. 350 nanoseconds for NVDIMM vs. 70 nanoseconds for DRAM.[i]  A much better analysis comes from Lenovo which examined a variety of load conditions and published their results in a white paper.[ii]  Here are some of the results:

  • A fully populated 6x DCPMM socket could deliver up to 40GB/s read throughput, 15GB/s write
  • Each additional pair of DCPMMs delivered proportional increases in throughput
  • Random reads had a load to use latency that was roughly 50% higher than sequential reads
  • Random reads had a max per socket (6x DCPMM) throughput that was between 10 and 13GB/s compared to 40 to 45GB/s for sequential reads

The most interesting quote from this section was: “Overall, workloads that are more read intensive and sequential in nature will see the best performance.”  This echoes the quote from SAP’s NVRAM white paper: “From the perspective (of) read accesses, sequential scans fare better in NVRAM than point reads: the cache line pre-fetch is expected to mitigate the higher latency.[iii]

The next section is even more interesting.  Some of its results comparing the performance differences of DRAM to DCPMM were:

  • Almost 3x better max sequential read bandwidth
  • Over 5x better max random read bandwidth
  • Over 5x better max sequential 2:1 R/W bandwidth
  • Over 8x better max random 2:1 R/W bandwidth
  • Latencies for DCPMM in the random 2:1 R/W test hit a severe knee of the curve and showed max latencies over 8x that of DRAM at very light bandwidth loads
  • DRAM, by comparison, continued to deliver significantly increasing bandwidth with only a small amount of latency degradation until it hit a knee of the curve at over 10x of the max DCPMM bandwidth

Unfortunately, this is not a direct indication of how an application like HANA might perform.  For that, we have to look at available benchmarks. To date, none of the SD benchmarks have utilized NVDIMMs.  Lenovo published a couple of BWH results, one with and one without NVDIMMs, but used different numbers of records, so they are not directly comparable.  HPE, on the other hand, published a couple of BWH results using the exact same systems and numbers of records.[iv]  Remarkably, only a small, 6% performance degradation, going from an all DRAM 3TB configuration to a mixed 768GB/3TB NVDIMM configuration occurred in the parallel query execution phase of the benchmark.  The exact configuration is not shown on the public web site, but we can assume something about the config based on SAP Note: 2700084 – FAQ: SAP HANA Persistent Memory: To achieve highest memory performance, all DIMM slots have to be used in pairs of DRAM DIMMs and persistent memory DIMMs, i.e. the system must be equipped with one DRAM DIMM and one NVDIMM in each memory channel.”  Vendors submitting benchmark results do not have to follow these guidelines, but if HPE did, then they used 24@32GB DRAM DIMMs and 24@128TB NVDIMMs.  Also, following other guidelines in the same SAP Note and the SAP HANA Administration Guide, HPE most likely placed the column store on NVDIMMS with row store, caches, intermediate and final results calculations on DRAM DIMMs.

BWH is a benchmark composed of 1.3 billion records which can easily be loaded into a 1TB system with room to spare.  To achieve larger configurations, vendors can load the same 1.3B records a second, third or more times, which HPE did a total of 5 times to get to 6.5B records.  The column compression dictionary tables, only grow with unique data, i.e. do not grow when you repeat the same data set regardless of the number of times it is added.

BWH includes 3 phases, a load phase which represents data ingestion from ERP, a parallel query phase and a sequential, single user complex query phase.  Some have focused on the ingestion and complex query phases, because they show the most degradation in performance vs. DRAM.  While that is tempting, I believe the parallel query phase is of the most relevance.  During this phase, 385 queries of low, medium and high complexity (no clue as to how SAP defines those complexities, what their SQL looks like or how many of each type are included) are run, in parallel and randomly.  After an hour, the total count of queries completed is reported. In theory, the larger the database, the fewer the queries that could be run per hour as each query would have more data to traverse.  However, that is not what we see in these results.

Lenovo, once again, provides the best insights here.  With Skylake processors, they reported two results.  On the first, they loaded 1.3B records, on the second 5.2B records or 4 times the number of rows with only twice the memory.  One might predict that queries per hour would be 4 times or more worse considering the non-proportionate increase in memory.  The results, however, show only a little over 2x decrease in Query/hr. Dell reported a similar set of results, this time with Cascade Lake, also with only real memory and also only around 2x decrease in Query/hr for 4X larger number of records.

What does that tell us? It is impossible to say for sure. From the SAP NVRAM white paper referenced earlier, “One can observe that some of the queries are more sensitive to the latency of the persistent memory than others. This can be explained by multiple factors:

  1. Does the query exhibit a memory access pattern that can easily prefetch by the hardware
  2. prefetchers? Is the working set of queries small enough to fit in CPU
  3. cache and hence agnostic to persistent memory latency? Is processing of the query compute or latency bound?”

SAP stores results in the “Static Cache”. “The static result cache is particularly helpful in the following scenario:  Complex query based on a view; Rather small result set; Limited amount of changes in the underlying tables.  The static result cache can provide the following advantages: Reduction of CPU consumption; Reduction of SAP HANA thread utilization; Performance improvements[v]

Other areas like delta storage, caches, intermediate result sets or row store remain solely in dynamic RAM (DRAM) is usually stored in DRAM, not NVDIMMs.[vi]

The data in BWH is completely static.  Some queries are complex and presumably based on views.   Since the same queries execute over and over again, prefetchers may become especially effective.  It may be possible that some or many of the 385 queries in BWH may be hitting the results cache in DRAM.  In other words, after the first set of queries run, a decent percentage of accesses may be hitting only the DRAM portion of memory, masking much of the latency and bandwidth issues of NVRAM.  In other words, this benchmark may actually be testing CPU power against a set of results cached in working memory more than actual query speed against column store.

So, let us now consider the HPE benchmark with NVDIMMs.  On the surface, 6% degradation with NVDIMMs vs. all DRAM seems improbable considering NVDIMM higher latency/lower bandwidth.  But after considering the above caching, repetitive data and repeating query set, it should not be much of a shock that this sort of benchmark could be masking the real performance effects.  Then we should consider the quote from Lenovo’s white paper above which said that NVDIMMs are a great technology for read intensive, sequential workloads.

Taken together, while not definitive, we can deduce that a real workload using more varied and random reads, against a non-repeating set of records might see a substantially different query throughput than demonstrated by this benchmark.

Believe it or not, there is even more detail on this subject, which will be the focus of a part 2 post.

 

[i]https://www.pcper.com/news/Storage/Intels-Optane-DC-Persistent-Memory-DIMMs-Push-Latency-Closer-DRAM

[ii]https://lenovopress.com/lp1083.pdf

[iii]http://www.vldb.org/pvldb/vol10/p1754-andrei.pdf

[iv]https://www.sap.com/dmc/exp/2018-benchmark-directory/#/bwh

[v]https://launchpad.support.sap.com/#/notes/2336344

[vi]https://launchpad.support.sap.com/#/notes/2700084

May 20, 2019 Posted by | Uncategorized | , , , , , , , , , , , | Leave a comment

Optane DC Persistent Memory – Proven, industrial strength or full of hype?

Intel® Optane™ DC persistent memory represents a groundbreaking technology innovation” says the press release from Intel.  They go on to say that it “represents an entirely new way of managing data for demanding workloads like the SAP HANA platform. It is non-volatile, meaning data does not need to be re-loaded from persistent storage to memory after a shutdown. Meanwhile, it runs at near-DRAM speeds, keeping up with the performance needs and expectations of complex SAP HANA environments, and their users.”  and “Total cost of ownership for memory for an SAP HANA environment can be reduced by replacing expensive DRAM modules with non-volatile persistent memory.”  In other words, they are saying that it performs well, lowers cost and improves restart speeds dramatically.  Let’s take a look at each of these potential benefits, starting with Performance, examine their veracity and evaluate other options to achieve these same goals.

I know that some readers appreciate the long and detailed posts that I typically write.  Others might find them overwhelming.  So, I am going to start with my conclusions and then provide the reasoning behind them in a separate posts.

Conclusions

Performance

Storage class memory is an emerging type of memory that has great potential but in its current form, Intel DC Persistent Memory, is currently unproven, could have a moderate performance impact to highly predictable, low complexity workloads; will likely have a much higher impact to more complex workloads and potentially a significant performance degradation to OLTP workloads that could make meeting performance SLAs impossible.

Some workloads, e.g. aged data in the form of extension nodes, data aging objects, HANA native storage extensions, data tiering or archives could be placed on this type of storage to improve speed of access.  On the other hand, if the SLAs for access to aged data do not require near in-memory speeds, then the additional cost of persistent memory over old, and very cheap, spinning disk may not be justified.

Highly predictable, simple, read-only query environments, such as canned reporting from a BW systems may derive some value from this class of memory however data load speeds will need to be carefully examined to ensure data ingestion throughput to encrypted persistent storage allow for daily updates within the allowed outage window.

Restart Speeds

Intel’s Storage Class memory is clearly orders of magnitude faster than external storage, whether SSD or other types of media.  Assuming this was the only issue that customers were facing, there were no performance or reliability implications and no other way to address restart times, then this might be a valuable technology.  As SAP has announced DRAM based HANA Fast Restart with HANA 2.0 SPS04 and most customers use HANA System Replication when they have high uptime requirements, the need for rapid restarts may be significantly diminished.  Also, this may be a solution to a problem of Intel’s own making as IBM Power Systems customers rarely share this concern, perhaps because IBM invested heavily in fast I/O processing in their processor chips.

TCO

On a GB to GB comparison, Optane is indeed less expensive than DRAM … assuming you are able to use all of it.  Several vendors’ and SAP’s guidance suggest you populate the same number of slots with NVDIMMs as are used for DRAM DIMMs.  SAP recommends only using NVDIMMs for columnar storage and historic memory/slot limitations are largely based on performance.  This means that some of this new storage may go unused which means the cost per used GB may not be as low as the cost per installed GB.

And if saving TCO is the goal, there are dozens of other ways in which TCO can be minimized, not just lowering the cost of DIMMs.  For customers that are really focused on reducing TCO, effective virtualization, different HA/DR methodologies, optimized storage and other associated IT cost optimization may have as much or more impact on TCO as may be possible with the use of storage class memory.  In addition, the cost of downtime should be included in any TCO analysis and since this type of memory is unproven in wide spread and/or large memory installations, plus the available memory protection is less than is available for DRAM based DIMMs, this potential cost to the enterprise may dwarf the savings from using this technology currently.

May 13, 2019 Posted by | Uncategorized | , , , , , , , , , , , , , | 1 Comment