Optane DC Persistent Memory – Proven, industrial strength or full of hype

Optane DC Persistent Memory – Proven, industrial strength or full of hype – Detail, part 1

Several non-Intel sites suggest that Intel’s storage class memory (Lenovo abbreviates these as DCPMM, while many others refer to them with the more generic term NVDIMM) delivers a read latency of roughly 5 times slower than DRAM, e.g. 350 nanoseconds for NVDIMM vs. 70 nanoseconds for DRAM.[i] A much better analysis comes from Lenovo which examined a variety of load conditions and published their results in a white paper.[ii] Here are some of the results:

A fully populated 6x DCPMM socket could deliver up to 40GB/s read throughput, 15GB/s write
Each additional pair of DCPMMs delivered proportional increases in throughput
Random reads had a load to use latency that was roughly 50% higher than sequential reads
Random reads had a max per socket (6x DCPMM) throughput that was between 10 and 13GB/s compared to 40 to 45GB/s for sequential reads

The most interesting quote from this section was: “Overall, workloads that are more read intensive and sequential in nature will see the best performance.” This echoes the quote from SAP’s NVRAM white paper: “From the perspective (of) read accesses, sequential scans fare better in NVRAM than point reads: the cache line pre-fetch is expected to mitigate the higher latency.[iii]

The next section is even more interesting. Some of its results comparing the performance differences of DRAM to DCPMM were:

Almost 3x better max sequential read bandwidth
Over 5x better max random read bandwidth
Over 5x better max sequential 2:1 R/W bandwidth
Over 8x better max random 2:1 R/W bandwidth
Latencies for DCPMM in the random 2:1 R/W test hit a severe knee of the curve and showed max latencies over 8x that of DRAM at very light bandwidth loads
DRAM, by comparison, continued to deliver significantly increasing bandwidth with only a small amount of latency degradation until it hit a knee of the curve at over 10x of the max DCPMM bandwidth

Unfortunately, this is not a direct indication of how an application like HANA might perform. For that, we have to look at available benchmarks. To date, none of the SD benchmarks have utilized NVDIMMs. Lenovo published a couple of BWH results, one with and one without NVDIMMs, but used different numbers of records, so they are not directly comparable. HPE, on the other hand, published a couple of BWH results using the exact same systems and numbers of records.[iv] Remarkably, only a small, 6% performance degradation, going from an all DRAM 3TB configuration to a mixed 768GB/3TB NVDIMM configuration occurred in the parallel query execution phase of the benchmark. The exact configuration is not shown on the public web site, but we can assume something about the config based on SAP Note: 2700084 – FAQ: SAP HANA Persistent Memory: “To achieve highest memory performance, all DIMM slots have to be used in pairs of DRAM DIMMs and persistent memory DIMMs, i.e. the system must be equipped with one DRAM DIMM and one NVDIMM in each memory channel.” Vendors submitting benchmark results do not have to follow these guidelines, but if HPE did, then they used 24@32GB DRAM DIMMs and 24@128TB NVDIMMs. Also, following other guidelines in the same SAP Note and the SAP HANA Administration Guide, HPE most likely placed the column store on NVDIMMS with row store, caches, intermediate and final results calculations on DRAM DIMMs.

BWH is a benchmark composed of 1.3 billion records which can easily be loaded into a 1TB system with room to spare. To achieve larger configurations, vendors can load the same 1.3B records a second, third or more times, which HPE did a total of 5 times to get to 6.5B records. The column compression dictionary tables, only grow with unique data, i.e. do not grow when you repeat the same data set regardless of the number of times it is added.

BWH includes 3 phases, a load phase which represents data ingestion from ERP, a parallel query phase and a sequential, single user complex query phase. Some have focused on the ingestion and complex query phases, because they show the most degradation in performance vs. DRAM. While that is tempting, I believe the parallel query phase is of the most relevance. During this phase, 385 queries of low, medium and high complexity (no clue as to how SAP defines those complexities, what their SQL looks like or how many of each type are included) are run, in parallel and randomly. After an hour, the total count of queries completed is reported. In theory, the larger the database, the fewer the queries that could be run per hour as each query would have more data to traverse. However, that is not what we see in these results.

Lenovo, once again, provides the best insights here. With Skylake processors, they reported two results. On the first, they loaded 1.3B records, on the second 5.2B records or 4 times the number of rows with only twice the memory. One might predict that queries per hour would be 4 times or more worse considering the non-proportionate increase in memory. The results, however, show only a little over 2x decrease in Query/hr. Dell reported a similar set of results, this time with Cascade Lake, also with only real memory and also only around 2x decrease in Query/hr for 4X larger number of records.

What does that tell us? It is impossible to say for sure. From the SAP NVRAM white paper referenced earlier, “One can observe that some of the queries are more sensitive to the latency of the persistent memory than others. This can be explained by multiple factors:

Does the query exhibit a memory access pattern that can easily prefetch by the hardware
prefetchers? Is the working set of queries small enough to fit in CPU
cache and hence agnostic to persistent memory latency? Is processing of the query compute or latency bound?”

SAP stores results in the “Static Cache”. “The static result cache is particularly helpful in the following scenario: Complex query based on a view; Rather small result set; Limited amount of changes in the underlying tables. The static result cache can provide the following advantages: Reduction of CPU consumption; Reduction of SAP HANA thread utilization; Performance improvements”[v]

“Other areas like delta storage, caches, intermediate result sets or row store remain solely in dynamic RAM (DRAM) is usually stored in DRAM, not NVDIMMs.[vi]

The data in BWH is completely static. Some queries are complex and presumably based on views. Since the same queries execute over and over again, prefetchers may become especially effective. It may be possible that some or many of the 385 queries in BWH may be hitting the results cache in DRAM. In other words, after the first set of queries run, a decent percentage of accesses may be hitting only the DRAM portion of memory, masking much of the latency and bandwidth issues of NVRAM. In other words, this benchmark may actually be testing CPU power against a set of results cached in working memory more than actual query speed against column store.

So, let us now consider the HPE benchmark with NVDIMMs. On the surface, 6% degradation with NVDIMMs vs. all DRAM seems improbable considering NVDIMM higher latency/lower bandwidth. But after considering the above caching, repetitive data and repeating query set, it should not be much of a shock that this sort of benchmark could be masking the real performance effects. Then we should consider the quote from Lenovo’s white paper above which said that NVDIMMs are a great technology for read intensive, sequential workloads.

Taken together, while not definitive, we can deduce that a real workload using more varied and random reads, against a non-repeating set of records might see a substantially different query throughput than demonstrated by this benchmark.

Believe it or not, there is even more detail on this subject, which will be the focus of a part 2 post.

[i]https://www.pcper.com/news/Storage/Intels-Optane-DC-Persistent-Memory-DIMMs-Push-Latency-Closer-DRAM

[ii]https://lenovopress.com/lp1083.pdf

[iii]http://www.vldb.org/pvldb/vol10/p1754-andrei.pdf

[iv]https://www.sap.com/dmc/exp/2018-benchmark-directory/#/bwh

[v]https://launchpad.support.sap.com/#/notes/2336344

[vi]https://launchpad.support.sap.com/#/notes/2700084

May 20, 2019 - Posted by Alfred Freudenberger | Uncategorized | bwh, DCPMM, dram, HANA, hpe, intel, lenovo, memory, NVDIMM, performance, persistent, SAP

No comments yet.

SAPonPower

An ongoing discussion about SAP infrastructure