SAPonPower

An ongoing discussion about SAP infrastructure

Optane DC Persistent Memory – Proven, industrial strength or full of hype – Detail, part 2

If the performance considerations from part 1 were the only issues, a reasonable case could be made for the potential value of doing a PoC with this technology.  But, of course, those are not the only issues.  One of the reasons that NVDIMMs have longer latencies than DRAM is due to their persistence and therefore the need to encrypt data placed on these components.  Encryption and decryption take a lot of computational power and can have a substantial impact on latency and bandwidth.  The funny thing is that encryption of these NVDIMMs can be turned off if desired, presumably with a resulting improvement to performance.  But what kind of customer would be willing to turn off this vital security technology?

Another desirable trait of modern, in-memory platforms is advanced memory protection which allows a system to continue to operate in the event of a DIMM failure.  This often starts with basic ECC, but then progresses to SDDC, DDDC (Chipkill or Lockstep), ADDDC (Skylake and beyond only) and IBM’s unique Chipkill + chip sparing technology.  ADDDC is not available for NVDIMMs, but DDDC is.  The downside of DDDC is that it comes with a significant performance penalty. No performance numbers have been provided for NVDIMMs configured with DDDC, but previous generations saw 20% to 40% degradation when using this mode.[i][ii]

What kind of customer would be willing to disable key security features or run critical systems without the best available reliability technologies?  I would certainly advise customers to use encryption and advanced reliability technologies in most circumstances.  Only those customers that can scramble business critical, PII and/or HIPAA data should ever consider disabling persistent memory encryption.  I searched, using every option that I could imagine, and failed to find a single web site that recommended ever disabling NVDIMM encryption.

SAP Benchmarks results posted on the external web site do not show the details of how security and reliability configuration parameters have been set.  It is therefore impossible to say whether HPE enabled or disabled these protection features.  In my many years of experience and extensive discussion with benchmarking experts, I can share that every single one, at every vendor, used every tool or technology that did not violate official rules to enhance results.  It would not be too much of a leap to project that HPE, and other vendors posting results with NVDIMMs, have likely disabled anything that might cause their results to diminish in any way.  (HPE, if you would like to share your configuration details, I would be happy to post them and if I have mischaracterized how you ran these benchmarks, will also post a retraction.) As a result, these BWH results may not only have relevance to only a small subset of the potential workloads but may also represent an unacceptable exposure to any company that has high single system availability requirements or has one of those unreasonable security departments which thinks that data protection is actually worthwhile.

And then, there are OLTP customers.  Based on the lack of benchmark testing of Suite on HANA, S/4HANA or C/4HANA combined with the above data from Lenovo about the massive reduction of bandwidth and associated huge increase in latency for OLTP, it would be MOST unwise to place any of these types of environments on systems with NVDIMMs without extensive testing of real customer workloads to ensure that internal performance SLAs can be met.

Certain types of workloads may perform decently with NVDIMMs.  BW environments where the primary use is for predictable and repeatable queries and reports may see only moderate performance degradation compared to DRAM based systems, but still orders of magnitude better performance that AnyDB systems which merely cache recently used data in memory and keep most data on external storage.  BW Extension nodes, S/4 Data aging objects and other types of archival systems that take older, less frequently used data and place them on other tiers of storage or systems, could certainly benefit from NVDIMMs.  Non-prod workloads which are not in the critical path to production, e.g. dev, test, sandbox, might make sense to place on systems with NVDIMMs.  All of these depend on an acceptance of potential performance issues and hardware/firmware/software fixes that inevitably come once customers start playing with version 1.0 of any new technology.

Based on likely performance issues, inferior RAS technology and the above mentioned “fix” dilemma, I would strongly advise that critical systems like production, QA, pre-prod, HA and DR should stay on DRAM based systems until bleeding edge customers prove the value of NVDIMMs and are willing to publicly share their journey.

The question then becomes whether the benefit to a subset of the environments are so substantial that it makes sense to select a vendor for HANA systems based on their ability to utilize NVDIMMs even when this technology might not be used for the most critical of the workloads and their associated critical path and HA/DR systems. This gets into the subjects of cost reduction and restart speeds which will be covered in part 3 of this series.

[i]https://lenovopress.com/lp0048.pdf

[ii]https://sp.ts.fujitsu.com/dmsp/Publications/public/wp-broadwell-ex-memory-performance-ww-en.pdf

May 27, 2019 Posted by | Uncategorized | , , , , , , , , , , , | Leave a comment

Optane DC Persistent Memory – Proven, industrial strength or full of hype – Detail, part 1

Several non-Intel sites suggest that Intel’s storage class memory (Lenovo abbreviates these as DCPMM, while many others refer to them with the more generic term NVDIMM) delivers a read latency of roughly 5 times slower than DRAM, e.g. 350 nanoseconds for NVDIMM vs. 70 nanoseconds for DRAM.[i]  A much better analysis comes from Lenovo which examined a variety of load conditions and published their results in a white paper.[ii]  Here are some of the results:

  • A fully populated 6x DCPMM socket could deliver up to 40GB/s read throughput, 15GB/s write
  • Each additional pair of DCPMMs delivered proportional increases in throughput
  • Random reads had a load to use latency that was roughly 50% higher than sequential reads
  • Random reads had a max per socket (6x DCPMM) throughput that was between 10 and 13GB/s compared to 40 to 45GB/s for sequential reads

The most interesting quote from this section was: “Overall, workloads that are more read intensive and sequential in nature will see the best performance.”  This echoes the quote from SAP’s NVRAM white paper: “From the perspective (of) read accesses, sequential scans fare better in NVRAM than point reads: the cache line pre-fetch is expected to mitigate the higher latency.[iii]

The next section is even more interesting.  Some of its results comparing the performance differences of DRAM to DCPMM were:

  • Almost 3x better max sequential read bandwidth
  • Over 5x better max random read bandwidth
  • Over 5x better max sequential 2:1 R/W bandwidth
  • Over 8x better max random 2:1 R/W bandwidth
  • Latencies for DCPMM in the random 2:1 R/W test hit a severe knee of the curve and showed max latencies over 8x that of DRAM at very light bandwidth loads
  • DRAM, by comparison, continued to deliver significantly increasing bandwidth with only a small amount of latency degradation until it hit a knee of the curve at over 10x of the max DCPMM bandwidth

Unfortunately, this is not a direct indication of how an application like HANA might perform.  For that, we have to look at available benchmarks. To date, none of the SD benchmarks have utilized NVDIMMs.  Lenovo published a couple of BWH results, one with and one without NVDIMMs, but used different numbers of records, so they are not directly comparable.  HPE, on the other hand, published a couple of BWH results using the exact same systems and numbers of records.[iv]  Remarkably, only a small, 6% performance degradation, going from an all DRAM 3TB configuration to a mixed 768GB/3TB NVDIMM configuration occurred in the parallel query execution phase of the benchmark.  The exact configuration is not shown on the public web site, but we can assume something about the config based on SAP Note: 2700084 – FAQ: SAP HANA Persistent Memory: To achieve highest memory performance, all DIMM slots have to be used in pairs of DRAM DIMMs and persistent memory DIMMs, i.e. the system must be equipped with one DRAM DIMM and one NVDIMM in each memory channel.”  Vendors submitting benchmark results do not have to follow these guidelines, but if HPE did, then they used 24@32GB DRAM DIMMs and 24@128TB NVDIMMs.  Also, following other guidelines in the same SAP Note and the SAP HANA Administration Guide, HPE most likely placed the column store on NVDIMMS with row store, caches, intermediate and final results calculations on DRAM DIMMs.

BWH is a benchmark composed of 1.3 billion records which can easily be loaded into a 1TB system with room to spare.  To achieve larger configurations, vendors can load the same 1.3B records a second, third or more times, which HPE did a total of 5 times to get to 6.5B records.  The column compression dictionary tables, only grow with unique data, i.e. do not grow when you repeat the same data set regardless of the number of times it is added.

BWH includes 3 phases, a load phase which represents data ingestion from ERP, a parallel query phase and a sequential, single user complex query phase.  Some have focused on the ingestion and complex query phases, because they show the most degradation in performance vs. DRAM.  While that is tempting, I believe the parallel query phase is of the most relevance.  During this phase, 385 queries of low, medium and high complexity (no clue as to how SAP defines those complexities, what their SQL looks like or how many of each type are included) are run, in parallel and randomly.  After an hour, the total count of queries completed is reported. In theory, the larger the database, the fewer the queries that could be run per hour as each query would have more data to traverse.  However, that is not what we see in these results.

Lenovo, once again, provides the best insights here.  With Skylake processors, they reported two results.  On the first, they loaded 1.3B records, on the second 5.2B records or 4 times the number of rows with only twice the memory.  One might predict that queries per hour would be 4 times or more worse considering the non-proportionate increase in memory.  The results, however, show only a little over 2x decrease in Query/hr. Dell reported a similar set of results, this time with Cascade Lake, also with only real memory and also only around 2x decrease in Query/hr for 4X larger number of records.

What does that tell us? It is impossible to say for sure. From the SAP NVRAM white paper referenced earlier, “One can observe that some of the queries are more sensitive to the latency of the persistent memory than others. This can be explained by multiple factors:

  1. Does the query exhibit a memory access pattern that can easily prefetch by the hardware
  2. prefetchers? Is the working set of queries small enough to fit in CPU
  3. cache and hence agnostic to persistent memory latency? Is processing of the query compute or latency bound?”

SAP stores results in the “Static Cache”. “The static result cache is particularly helpful in the following scenario:  Complex query based on a view; Rather small result set; Limited amount of changes in the underlying tables.  The static result cache can provide the following advantages: Reduction of CPU consumption; Reduction of SAP HANA thread utilization; Performance improvements[v]

Other areas like delta storage, caches, intermediate result sets or row store remain solely in dynamic RAM (DRAM) is usually stored in DRAM, not NVDIMMs.[vi]

The data in BWH is completely static.  Some queries are complex and presumably based on views.   Since the same queries execute over and over again, prefetchers may become especially effective.  It may be possible that some or many of the 385 queries in BWH may be hitting the results cache in DRAM.  In other words, after the first set of queries run, a decent percentage of accesses may be hitting only the DRAM portion of memory, masking much of the latency and bandwidth issues of NVRAM.  In other words, this benchmark may actually be testing CPU power against a set of results cached in working memory more than actual query speed against column store.

So, let us now consider the HPE benchmark with NVDIMMs.  On the surface, 6% degradation with NVDIMMs vs. all DRAM seems improbable considering NVDIMM higher latency/lower bandwidth.  But after considering the above caching, repetitive data and repeating query set, it should not be much of a shock that this sort of benchmark could be masking the real performance effects.  Then we should consider the quote from Lenovo’s white paper above which said that NVDIMMs are a great technology for read intensive, sequential workloads.

Taken together, while not definitive, we can deduce that a real workload using more varied and random reads, against a non-repeating set of records might see a substantially different query throughput than demonstrated by this benchmark.

Believe it or not, there is even more detail on this subject, which will be the focus of a part 2 post.

 

[i]https://www.pcper.com/news/Storage/Intels-Optane-DC-Persistent-Memory-DIMMs-Push-Latency-Closer-DRAM

[ii]https://lenovopress.com/lp1083.pdf

[iii]http://www.vldb.org/pvldb/vol10/p1754-andrei.pdf

[iv]https://www.sap.com/dmc/exp/2018-benchmark-directory/#/bwh

[v]https://launchpad.support.sap.com/#/notes/2336344

[vi]https://launchpad.support.sap.com/#/notes/2700084

May 20, 2019 Posted by | Uncategorized | , , , , , , , , , , , | Leave a comment