optane « SAPonPower

POWER10 – Memory Sharing and how HANA customers will benefit

As an in-memory database, SAP HANA is obviously limited by access to memory. Having massive CPU throughput with a small amount of memory could be useful for a an HPC application that needs to crunch through trillions of operations on a small amount of data. By comparison, a HANA system typically scales up with both CPU and memory at the same time.

Intel attempted to solve this problem through the use of large scale persistent DIMMs. Unfortunately, they delivered a completely unbalanced solution with their Cascade Lake processors that included a small incremental performance increase coupled with Optane DIMMs which are 3 to 5 times slower than DDR4 DIMMs (at best). By the way, the new “Barlow Pass” Optane DIMMs, that will be available with next gen Copper Lake and Ice Lake systems, will reportedly only deliver 15% bandwidth improvement over today’s “Apache Pass” DIMMs.[i] Allow me to clap with one hand at that yawner of an improvement. Their solution is somewhat analogous to a transportation problem where a road has 2 lanes and is out of capacity. You can increase the horsepower of each vehicle a bit and pack in far more seats in each vehicle, but it will likely to take longer to get all of the various passengers in different vehicles to their destination and they will most assuredly be much more uncomfortable.

IBM with POWER10, but comparison, attacked this problem by addressing all aspects simultaneously. As mentioned in part 1, POWER10 sockets have the potential of delivering 3 times the workload of POWER9 sockets. So, not a small incremental improvement as with Cascade Lake, but a massive one. Then they increased the bandwidth to memory by at least 4x, meaning they can keep the CPUs fed with data … and in case memory can’t keep up, they added support for DDR5 and its much faster speeds and throughput. Then they increased socket to socket communications bandwidth by a factor of four since transactional and analytic workloads, like HANA, tend to be spread across sockets and often need to access data from another socket. And just in case the system runs out of DIMM sockets, they introduced a new capability, “memory clustering” or “memory inception”[ii] which allows memory on another physical system to be accessed remotely (more on this later) with a 50 to 100ns latency hit[iii]. And just to make sure that I/O did not become the next bottleneck, they have doubled down on their previous leadership with being the first major vendor to support PCIe Gen4 by including support for PCIe Gen5 with a potential for twice the I/O throughput.

Using the previous analogy, IBM attacked the problem by tripling the horsepower of each vehicle with lots of extra doors and comfortable seats, quadrupling the number of lanes on the road and enabled each vehicle to support tandem additions. In other words, everybody can get to their destinations much faster and in great comfort.

So, what is this memory clustering? Put simply, it is an IBM developed technology which enables a VM on one system to map memory allocated to it by PowerVM on another system as if it was locally attached. In this way, a VM which requires more memory than is available on the system upon which it is running can be provided with that memory from one or more systems in the cluster. It does this through the same PowerAXON (IBM’s SMP interconnect technology) as is used within each system across sockets. As a result, the projected additional latency is likely to be only slightly higher than accessing memory across the NUMA fabric.

IBM described multiple different potential topologies, ranging from “Enterprise Class” at extreme bandwidth, to hub and spoke; mixing and matching CPU heavy with Memory heavy nodes to even multi-hop pod-level clustering of potentially thousands of nodes. With POWER10 featuring a 2 Petabyte memory addressability, the possibilities are mind boggling.

For HANA workloads, I see a range of possibilities. The idea that a customer could extend memory across systems is the utopia of “never having to say I’m sorry”. By that, I mean that in the bad old days (current times that is), if you purchased, for example, a 2TB system with all DIMM slots used, and your HANA instance needed a tiny amount more memory than available on the system, you had three choices: 1) let HANA deal with insufficient memory and start moving columns in and out of memory with all of the associated performance impact implied, 2) move the workload to a larger system at a substantial cost and a loss of the existing investment (which always brings a smile and hug from the CFO) or 3) if possible, shut down the instance and the system, rip out all existing DIMMs and replace them with larger ones (even more disruptive and still very expensive).

With memory clustering, you could harvest unused capacity in your cluster at no incremental cost. Or, if all memory was in use, you could resize a less important workload, e.g. a HANA sandbox VM or non-prod app server, and reallocate it to the production VM requiring more memory. Or you could move a less important workload to a different server potentially in a different data center or perhaps a much smaller system and reuse the memory using clustering. Or you could purchase a small, low GHz, small number of activated cores system to add to the cluster with plenty of available memory to be used by the various VMs in the cluster. The possibilities are endless, but you will notice, having to say to management that “I blew it” was not one of the options.

Does this take the place of “Storage Class Memory” (SCM) aka persistent memory? Not at all. In fact, POWER10 has explicit support for SCM DIMMs. The question is more of whether SCM technology is ready for HANA. At 3 to 5 times worse latency than DRAM, Intel’s SCM, Optane, most certainly is not. In fact, I call it highly irresponsible to promote a technology with barely a mention of the likely performance drawbacks as has been done by Intel and their merry band of misinformation brethren, e.g. HPE, Cisco, Dell, etc.

I prefer IBM’s more measured approach of supporting technology options, encouraging openness and ecosystem innovation, and focusing on delivering real value with solutions that make sense now as opposed to others’ approaches that can lead customers down a path where they will inevitably have to apologize later when things don’t work as promised. I am also looking forward to 2021 to see what sort of POWER10 systems and related infrastructure options IBM will announce.

[i] https://www.tomshardware.com/news/intel-barlow-pass-dimm-3200mts-support-15w-tdp

[ii] https://www.crn.com/news/components-peripherals/ibm-power10-cpu-s-memory-inception-is-industry-s-holy-grail-

https://www.servethehome.com/ibm-power10-searching-for-the-holy-grail-of-compute/hot-chips-32-ibm-power10-memory-clustering-enterprise-scale-memory-sharing/

[iii] https://www.hpcwire.com/2020/08/17/ibm-debuts-power10-touts-new-memory-scheme-security-and-inferencing/

August 19, 2020 Posted by Alfred Freudenberger | Uncategorized | @ibmchampions, HANA, IBM, memory inception, memory sharing, optane, persistent memory, Power Systems, POWER10, SAP, SCM | Leave a comment

Does Intel’s Optane DC Persistent Memory decrease TCO for SAP?

When this new type of persistent memory DIMM (PMEM) was announced by Intel about a year ago, improving restart times was the most important factor cited by Intel and vendors of systems that utilize Intel Cascade Lake processors. Some of my previous blog posts have discussed the performance issues of PMEM and despite numerous searches, I can find no data presented by Intel or any other vendor to suggest that any improvement has occurred since this technology was made generally available. Over time, and perhaps as more customers realized that faster restarts at the cost of slower operational performance might not be very compelling, the message started to morph into saving money.

Regarding TCO specifically for SAP Suite on HANA (SoH) and S/4HANA, let’s start with the basic assertion, i.e. PMEM is less expensive than DRAM. This is documented by pricing which shows a 128GB PMEM DIMM costs approximately 60% of the cost of a 128 DRAM DIMM[i] on one site and 40%[ii] on another site. This discrepancy may result when one vendor shows effective prices and another list prices with the list price example showing a higher cost savings with PMEM.

I was interested to see what would happen with actual SAP instances. For comparison, let us start with a conventional DRAM memory system and assume that after using appropriate sizing tools, we have determined that an SoH or S/4HANA system requires a total of 6TB of memory to support 3TB of data with 3TB dedicated to system and HANA working memory. I chose 6TB because this fits perfects on most Intel systems using 4 processors and 48@128GB memory DIMMs. This config also has the added bonus of no waste at all and maximized performance since parallelism is optimized when every memory channel is used.

By comparison, we need to figure out how much memory is required if we utilize PMEM. The SAP note on persistent memory[iii] describes ratios of DRAM to PMEM ranging from 2:1 to 1:4. For SoH and S/4HANA, the advice given is to run QuickSizer, /SDF/HDB_SIZING or ZNEWHDB_SIZE depending on where you are starting from. I asked 3 different customers, one small, one medium and one very large, to provide me with the output of their sizing reports based on existing ECC systems. I have included two key sections for the midsized customer:

Screen Shot 2020-02-04 at 11.57.33 AM

The Persistent Memory FAQ[iv] says: “Persistent memory can be used for the main storage of column store table that is typically the dominating factor of data space consumption in SAP HANA environments. Other areas like delta storage, caches, intermediate result sets or row store remain solely in dynamic RAM (DRAM). Disk LOBs (SAP Note 2220627) are also not part of the persistent memory.” If you add up the numbers above using this rule, you may notice that this means that when using persistent memory, the amount of data housed in PMEM vs. DRAM does not fit with any of the ratios mentioned earlier. Looking at the sizing reports that I obtained, the amount of PMEM vs. DRAM was more in the range of 1:1.5[v].

Now, let’s apply the very best ratio of the three reports, i.e. the very large customer, to our 6TB example above and we see that we need 6TB x .433 = 2.53TB PMEM and 6TB x .567 = 3.47TB DRAM. Assuming 128GB DIMMs, this translates to 20.2 PMEM DIMMs and 27.8 DRAM DIMMs which rounded up comes to 21 and 28 DIMMs, i.e. 49 DIMMs total. Clearly, this is one more than the max number (48) in a 4-socket system. In addition, SAP note 2786237 states that a configuration must have: “Homogeneous symmetrical assembly of DRAM and PMEM DIMMs with maximum utilization of all memory channels per processor”, so the minimum configuration would be 28 of each type of DIMM for a total of 56 DIMM slots.

To the best of my knowledge, no Cascade Lake system supports this number of DIMM slots. Several vendors support 64 DIMM slots on a 6 or 8-socket system. Those that do not would require a 96 DIMM slots configuration. At 64 DIMM slots, this configuration would waste the difference between the HANA memory requirement and the system configuration requirement, i.e. 540GB of DRAM and 1,508GB of PMEM would be wasted. At 96 DIMM slots, the waste would be 2,588GB and 3,556GB respectively. With either a 64 DIMM slot or 96 DIMM slot configuration, instead of a relatively affordable 4-socket system, a significantly more expensive 6 or 8-socket system would be required.

I chose to use the best pricing that I could find for DIMM prices assuming that other vendors would be able to match these prices. I then applied that pricing to those vendors that can utilize 64 DIMM slots on a 6 or 8-socket configuration. After a simple calculation[vi], the cost of just the memory of the DRAM+PMEM system came out $7,648 higher than the DRAM only system. And remember, this is before adding in any additional costs for more processors and a system which can support more processors.

Of course, 256GB DRAM DIMMs could be used, reducing the DRAM DIMM count to 15, but this raises a thorny issue; No appliance has been certified by SAP[vii] with 256GB DRAM DIMMs. Even if we ignored that issue and went out on a limb using TDI V5 relaxed rules, the significantly higher cost of 256GB DRAM DIMMs over 128GB DRAM DIMMs[viii] plus the need to round up to 24 DIMM slots would result in a configuration that was still substantially higher cost than the DRAM only configuration.

Any way that you cut it, the use of PMEM in a realistic SoH or S/4HANA configuration results in a higher cost of acquisition than a DRAM only configuration. In other words, as shown in the previous blog posts, performance takes a major hit when using PMEM for HANA, it does not save any money and actually costs more and the only potential gain comes from faster restarts.

[i] https://www.dell.com/en-us/work/shop/cty/pdp/spd/poweredge-r940/pe_r940_12229_vi_vp?configurationid=0163c707-0003-46a0-808a-3b55c864ba70

[ii] https://dcsc.lenovo.com/#/configuration/cto/7X13CTO1WW?hardwareType=server

[iii] https://launchpad.support.sap.com/#/notes/2786237

[iv] https://launchpad.support.sap.com/#/notes/2700084

[v] actual range was 41.5% to 43.3% for PMEM versus 58.5% to 56.7% for DRAM based on the small to very large reports

[vi] 48 x $2,670 = 128,160 (DRAM only), 32 x $1,574 + 32 x $2,670 = $135,808 (DRAM + PMEM)

[vii] https://www.sap.com/dmc/exp/2014-09-02-hana-hardware/enEN/appliances.html#viewcount=100

[viii] https://dcsc.lenovo.com/#/configuration/cto/7X13CTO1WW?hardwareType=server

February 4, 2020 Posted by Alfred Freudenberger | Uncategorized | @ibmchampions, dc, HANA, intel, memory, nvram, optane, persistent, pmem, S/4HANA, SAP, SoH | 2 Comments

Optane DC Persistent Memory – Proven, industrial strength or full of hype – Detail, part 3

In this final of a three part series, we will explore the two other major “benefits” of Optane DIMMs: fast restart and TCO.

Fast restart

HANA, as an in-memory database, must be loaded into memory to perform well. Intel, for years and, apparently up to current times, has suffered with a major bottleneck in its I/O subsystem. As a result, loading a single terabyte of data into memory could take 10 to 20 minutes in a best-case scenario. Anecdotally, some customers have remarked that placing superfast, all flash subsystems, such as IBM’s FlashSystem 9100, behind an Intel HANA system resulted in little improvement in load times compared to mid-range SSD subsystems. For customers attempting to bring up a 10TB storage/20TB memory HANA system, this could result in load times measured in hours. As a result, a faster way of getting a HANA system up and running was sorely needed.

This did not appear to be a problem for customers using IBM’s Power Systems. Not only has Power delivered roughly twice the I/O bandwidth of Intel systems for years, but with POWER9, IBM introduced PCIe Gen4, further extending their leadership in this area. The bottleneck is actually in the storage subsystem and number of paths that it can drive, not in the processor. To prove this, IBM ran a test with 10 NVMe cards in PCIe slots and was able to drive load speeds into HANA of almost 1TB/min.[I]. In other words, to improve restart times, Power Systems customers need only move to faster subsystems and/or add more or faster paths.

This suggests that Intel’s motivation for NVDIMMs may be to solve a problem of their own making. But this also raises a question of their understanding of HANA. If a customer is running a transactional workload such as Suite on HANA, S/4 or C/4, and is using HANA System Replication, wouldn’t at least one of the pair of nodes be available at all times? SAP supports near zero upgrades[ii], so systems, firmware, OS or even HANA itself may be updated on one of the pair of nodes while the other continues to operate, followed by a synchronization of changed data and a controlled failover so that the first node might be updated. In this way, cold restarts of HANA, where a fast restart option might make a big difference, may be driven down into a very rare occurrence. In other words, wouldn’t this be a better option than causing poor performance to everything due to radically slower DIMMs compared to DRAM as has been discussed in gory detail on the previous two posts of this series?

HANA also offers a quick restart option whereby HANA can be started and the database made available within minutes even though all of the columns have not yet been loaded into memory. Yes, performance will be pretty bad until all columns are loaded into memory, but for non-production systems and non-mission critical systems, this might be an acceptable option. Lastly, with HANA 2.0 SPS04, SAP now supports fast restart with conventional memory.[iii] This only works when the OS stays up and running, i.e. can’t be used when the system, firmware or OS is being updated, but this can be used for the vast majority of required restarts, e.g. HANA upgrades, patches and restarts when a bounce of the HANA environment is needed. Though this is not mentioned in the help documentation, it may even be possible to patch the Linux kernel while using the fast restart option if SUSE SLES is used with their “Live Patching” function.[iv]

TCO

Optane DIMMs are less expensive than DRAM DIMMs. List prices appears to be about 40% cheaper when comparing same size DIMMs. Effective prices, however, may have a much smaller delta since there exists competition for DRAM meaning discounts may be much deeper than for the NVDIMMs from Intel, currently the only source. This assumes full utilization of those NVDIMMs which may prove to be a drastically bad assumption. Sizing guidance from SAP[v]shows that the ratio of DRAM vs. PMEM (their term for NVDIMMs) capacity can be anything from 2:1 to 1:4, but it provides no guidance as to where a given workload might fall or what sort of performance impact might result. This means that a customer might purchase NVDIMMs with a capacity ratio of 1:2, e.g. 1TB DRAM:2TB PMEM, but might end up only being able to utilize only 512GB or 1TB PMEM due to negative performance results. In that case, the cost of effective NVDIMMs would have instantly doubled or quadrupled and would, effectively, be more expensive than DRAM DIMMs.

But let us assume the best rather than the worst. Even if only a 2:1 ratio works relatively well, the cost of the NVDIMMs, if sized for that ratio, would be somewhat lower than the equivalent cost of DRAM DIMMs. The problem is that memory, while a significant portion of the cost of systems, is but one element in the overall TCO of a HANA landscape. If reducing TCO is the goal, shouldn’t all options be considered?

Virtualization has been in heavy use by most customers for years helping to drive up system utilization resulting in the need for fewer systems, decreasing network and SAN ports, reducing floor space and power/cooling and, perhaps most importantly, reducing the cost of IT management. Unfortunately, few high end customers, other than those using IBM Power Systems can take advantage of this technology in the HANA world due to the many reasons identified in the latest of many previous posts.[vi] Put another way, if a customer utilizes an industrial strength and proven virtualization solution for HANA, i.e. IBM PowerVM, they may be able to reduce TCO considerably[vii]and potentially much more than the relatively small improvement due to NVDIMMs.

But if driving down memory costs is the only goal, there are a couple of ideas that are less radical than using NVDIMMs worth investigating. Depending on RTO requirements, some workloads might need an HA option, but might not require it to be ready in minutes. If this is the case, then a cold standby server running other workloads which could be killed in the event of a system outage could be utilized, e.g. QA, Dev, Test, Sandbox, Hadoop. Since no incremental memory would be required, memory costs would be substantially lower than that required for System Replication, even if NVDIMMs are used. IBM offers a tool called VM Recovery Manager which can instrument and automate such a configuration.

Another option worth considering, only for non-production workloads, is a feature of IBM PowerVM called Memory Deduplication. After different VMs are started using “a shared memory pool”, the hypervisor builds a logical memory map. It then scans the pages of each VM looking for identical memory pages at which time it uses the logical memory map to point each VM to the same real memory page thereby freeing up the redundant memory pages for use by other workloads. If a page is subsequently changed by one of the VMs, the hypervisor simply recreates a unique real memory page for that VM. The upshot of this feature is that the total quantity of DRAM memory may be reduced substantially for workloads that are relatively static and have large amounts of duplication between them. The reason that this should not be used for production is because when the VMs start, the hypervisor has not yet had the chance to deduplicate the memory pages and, if the sum of logical memory of all VMs is larger than the total memory, paging will occur. This will subside over time and may be of little consequence to non-production workloads, but the risk to performance for production might be considered unacceptable and, besides, “Memory over-commitment must not be used” for production HANA according to SAP.

Summary

Faster restarts than may be possible with traditional Intel systems may be achieved by using near zero HANA upgrades with System Replication, HANA fast restart or by switching to a system with a radically faster I/O subsystem, e.g. IBM Power Systems. TCO may be reduced with tried and proven virtualization technologies as provided with IBM PowerVM, cold standby systems or memory deduplication rather than experimenting with version 1.0 of a new technology with no track record, unknown reliability, poor guidance on sizing and potentially huge impacts to performance.

[i]https://www.ibm.com/downloads/cas/WQDZWBYJ

[ii]https://launchpad.support.sap.com/#/notes/1984882

[iii]https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.04/en-US/ce158d28135147f099b761f8b1ee43fc.html

[iv]https://launchpad.support.sap.com/#/notes/1984787

[v]https://launchpad.support.sap.com/#/notes/2786237

[vi]https://saponpower.wordpress.com/2018/09/26/vmware-pushes-past-4tb-sap-hana-limit/

[vii]https://www.ibm.com/downloads/cas/M7X2YXZD

June 3, 2019 Posted by Alfred Freudenberger | Uncategorized | deduplication, dimms, fast, HANA, IBM, intel, memory, NVDIMM, optane, persistent, pmem, Power Systems, PowerVM, restart, TCO | 1 Comment

Optane DC Persistent Memory – Proven, industrial strength or full of hype – Detail, part 2

If the performance considerations from part 1 were the only issues, a reasonable case could be made for the potential value of doing a PoC with this technology. But, of course, those are not the only issues. One of the reasons that NVDIMMs have longer latencies than DRAM is due to their persistence and therefore the need to encrypt data placed on these components. Encryption and decryption take a lot of computational power and can have a substantial impact on latency and bandwidth. The funny thing is that encryption of these NVDIMMs can be turned off if desired, presumably with a resulting improvement to performance. But what kind of customer would be willing to turn off this vital security technology?

Another desirable trait of modern, in-memory platforms is advanced memory protection which allows a system to continue to operate in the event of a DIMM failure. This often starts with basic ECC, but then progresses to SDDC, DDDC (Chipkill or Lockstep), ADDDC (Skylake and beyond only) and IBM’s unique Chipkill + chip sparing technology. ADDDC is not available for NVDIMMs, but DDDC is. The downside of DDDC is that it comes with a significant performance penalty. No performance numbers have been provided for NVDIMMs configured with DDDC, but previous generations saw 20% to 40% degradation when using this mode.[i][ii]

What kind of customer would be willing to disable key security features or run critical systems without the best available reliability technologies? I would certainly advise customers to use encryption and advanced reliability technologies in most circumstances. Only those customers that can scramble business critical, PII and/or HIPAA data should ever consider disabling persistent memory encryption. I searched, using every option that I could imagine, and failed to find a single web site that recommended ever disabling NVDIMM encryption.

SAP Benchmarks results posted on the external web site do not show the details of how security and reliability configuration parameters have been set. It is therefore impossible to say whether HPE enabled or disabled these protection features. In my many years of experience and extensive discussion with benchmarking experts, I can share that every single one, at every vendor, used every tool or technology that did not violate official rules to enhance results. It would not be too much of a leap to project that HPE, and other vendors posting results with NVDIMMs, have likely disabled anything that might cause their results to diminish in any way. (HPE, if you would like to share your configuration details, I would be happy to post them and if I have mischaracterized how you ran these benchmarks, will also post a retraction.) As a result, these BWH results may not only have relevance to only a small subset of the potential workloads but may also represent an unacceptable exposure to any company that has high single system availability requirements or has one of those unreasonable security departments which thinks that data protection is actually worthwhile.

And then, there are OLTP customers. Based on the lack of benchmark testing of Suite on HANA, S/4HANA or C/4HANA combined with the above data from Lenovo about the massive reduction of bandwidth and associated huge increase in latency for OLTP, it would be MOST unwise to place any of these types of environments on systems with NVDIMMs without extensive testing of real customer workloads to ensure that internal performance SLAs can be met.

Certain types of workloads may perform decently with NVDIMMs. BW environments where the primary use is for predictable and repeatable queries and reports may see only moderate performance degradation compared to DRAM based systems, but still orders of magnitude better performance that AnyDB systems which merely cache recently used data in memory and keep most data on external storage. BW Extension nodes, S/4 Data aging objects and other types of archival systems that take older, less frequently used data and place them on other tiers of storage or systems, could certainly benefit from NVDIMMs. Non-prod workloads which are not in the critical path to production, e.g. dev, test, sandbox, might make sense to place on systems with NVDIMMs. All of these depend on an acceptance of potential performance issues and hardware/firmware/software fixes that inevitably come once customers start playing with version 1.0 of any new technology.

Based on likely performance issues, inferior RAS technology and the above mentioned “fix” dilemma, I would strongly advise that critical systems like production, QA, pre-prod, HA and DR should stay on DRAM based systems until bleeding edge customers prove the value of NVDIMMs and are willing to publicly share their journey.

The question then becomes whether the benefit to a subset of the environments are so substantial that it makes sense to select a vendor for HANA systems based on their ability to utilize NVDIMMs even when this technology might not be used for the most critical of the workloads and their associated critical path and HA/DR systems. This gets into the subjects of cost reduction and restart speeds which will be covered in part 3 of this series.

[i]https://lenovopress.com/lp0048.pdf

[ii]https://sp.ts.fujitsu.com/dmsp/Publications/public/wp-broadwell-ex-memory-performance-ww-en.pdf

May 27, 2019 Posted by Alfred Freudenberger | Uncategorized | bwh, DCPMM, dram, HANA, hpe, intel, lenovo, memory, nvdimms, optane, persistent, SAP | Leave a comment

Optane DC Persistent Memory – Proven, industrial strength or full of hype?

“Intel® Optane™ DC persistent memory represents a groundbreaking technology innovation” says the press release from Intel. They go on to say that it “represents an entirely new way of managing data for demanding workloads like the SAP HANA platform. It is non-volatile, meaning data does not need to be re-loaded from persistent storage to memory after a shutdown. Meanwhile, it runs at near-DRAM speeds, keeping up with the performance needs and expectations of complex SAP HANA environments, and their users.” and “Total cost of ownership for memory for an SAP HANA environment can be reduced by replacing expensive DRAM modules with non-volatile persistent memory.” In other words, they are saying that it performs well, lowers cost and improves restart speeds dramatically. Let’s take a look at each of these potential benefits, starting with Performance, examine their veracity and evaluate other options to achieve these same goals.

I know that some readers appreciate the long and detailed posts that I typically write. Others might find them overwhelming. So, I am going to start with my conclusions and then provide the reasoning behind them in a separate posts.

Conclusions

Performance

Storage class memory is an emerging type of memory that has great potential but in its current form, Intel DC Persistent Memory, is currently unproven, could have a moderate performance impact to highly predictable, low complexity workloads; will likely have a much higher impact to more complex workloads and potentially a significant performance degradation to OLTP workloads that could make meeting performance SLAs impossible.

Some workloads, e.g. aged data in the form of extension nodes, data aging objects, HANA native storage extensions, data tiering or archives could be placed on this type of storage to improve speed of access. On the other hand, if the SLAs for access to aged data do not require near in-memory speeds, then the additional cost of persistent memory over old, and very cheap, spinning disk may not be justified.

Highly predictable, simple, read-only query environments, such as canned reporting from a BW systems may derive some value from this class of memory however data load speeds will need to be carefully examined to ensure data ingestion throughput to encrypted persistent storage allow for daily updates within the allowed outage window.

Restart Speeds

Intel’s Storage Class memory is clearly orders of magnitude faster than external storage, whether SSD or other types of media. Assuming this was the only issue that customers were facing, there were no performance or reliability implications and no other way to address restart times, then this might be a valuable technology. As SAP has announced DRAM based HANA Fast Restart with HANA 2.0 SPS04 and most customers use HANA System Replication when they have high uptime requirements, the need for rapid restarts may be significantly diminished. Also, this may be a solution to a problem of Intel’s own making as IBM Power Systems customers rarely share this concern, perhaps because IBM invested heavily in fast I/O processing in their processor chips.

TCO

On a GB to GB comparison, Optane is indeed less expensive than DRAM … assuming you are able to use all of it. Several vendors’ and SAP’s guidance suggest you populate the same number of slots with NVDIMMs as are used for DRAM DIMMs. SAP recommends only using NVDIMMs for columnar storage and historic memory/slot limitations are largely based on performance. This means that some of this new storage may go unused which means the cost per used GB may not be as low as the cost per installed GB.

And if saving TCO is the goal, there are dozens of other ways in which TCO can be minimized, not just lowering the cost of DIMMs. For customers that are really focused on reducing TCO, effective virtualization, different HA/DR methodologies, optimized storage and other associated IT cost optimization may have as much or more impact on TCO as may be possible with the use of storage class memory. In addition, the cost of downtime should be included in any TCO analysis and since this type of memory is unproven in wide spread and/or large memory installations, plus the available memory protection is less than is available for DRAM based DIMMs, this potential cost to the enterprise may dwarf the savings from using this technology currently.

May 13, 2019 Posted by Alfred Freudenberger | Uncategorized | dc, DCPMM, dimms, dram, HANA, in-memory, memory, non-volatile, NVDIMM, optane, performance, persistent, SAP, TCO | 1 Comment

Persistent Memory for HANA @ SapphireNow Orlando 2018

Once again, Intel and the companies that utilize their processors were all abuzz at Sapphire about Intel Optane DC Persistent Memory (PMEM). This is the second year in a row that they have been touting this future technology and its ability to fit into a DIMM form factor and take the place of some of the main memory currently supply by DRAM. I was intrigued until I saw Hasso Plattner, at SapphireNow 2018 Orlando, explain how HANA would utilize this technology. He showed a chart where a 6TB HANA DB startup time of 50 minutes reduced to 4 min with a 50/50 mix of standard DRAM DIMMs and the new Intel PMEM DIMMs. As he explained it, HANA column store would reside in PMEM, while working space and delta store tables would reside in DRAM.

50 minutes down 4 minutes sounds outstanding, but let’s see if we can pull back the veil a bit.

Who created the chart? Dr. Plattner was vague about this. He suggested that it might be from an internal test. When I asked multiple vendors, including Intel, if any parts were available for customer testing, I was told no and that it would require Cascade Lake, Intel’s next version of the current Skylake chips, to drive these new DIMMs. I suspect that Dr. Plattner was referring to the 50 minutes as being from an internal test. This means that the 4 minute projection with PMEM may have come from another source, e.g. Intel.

Why 50 minutes? It might seem reasonable to assume that if this was an internal test, that SAP knows how to configure a system properly, so it was probably using best of breed SSD technology, e.g. Intel’s SSD 750. 50 minutes works out to roughly 122GB/min after HANA SW load. IBM published a white paper in which Power Systems achieved approximately 172GB/min (30% faster) with a typical mid-range SSD subsystem and almost 1 TB/min with NVMe based SSDs, i.e. 740% faster.[i] In other words, if 50 minutes for 6TB is longer than acceptable, Power Systems can already deliver radically faster startup time without using esoteric and untested memory concepts.

For the Intel world, getting down from 50 minutes to 4 minutes would be quite a feat, but how often is this sort of restart likely to happen? Assuming an SAP client is not using one of the near-zero maintenance options, this depends on the frequency of such updates but most typically a couple of times per year. More often, for Intel customers, predictive failure analysis on memory will call out memory DIMMs for replacement once or twice a month, so more frequent reboots may be required. Of course, using a more reliable memory technology such as that offered in Power Systems could alleviate this requirement. It is ironic that the use of unreliable DRAM memory options in x86 systems could be the very cause of why faster restarts are needed.

Speaking of reliability, is PMEM RAID protected like disk? The answer, based on what has been published so far, appears to be no. In other words, if a PMEM DIMM were to fail, not only could this cause the system to fail, but since this would result in an incomplete or corrupted memory image, reload from the storage subsystem would still be required. Even more irony that the fast restart functionality of PMEM would be of no use when PMEM itself is the cause of the outage. Also, this would be the first commercial use of this technology. Good thing that Version 1.0 of anything usually works perfectly!

Next, let us consider the effect of having column store in PMEM. If SAP has said it once, they have said it 1,000 times, the “H” in “HANA” refers to “High Performance”. If you slow down access to column store by a factor of 5x or 10x, you get a cascading effect on just about every possible KPI in the system. Wait, did I say 5x or 10x? I hate it when I have to resort to quoting the source: “Intel senior vice president Rob Crooke and Micron CEO Mark Durcan declared 3D Xpoint to be 1,000 times faster and 1,000 times more durable than NAND”[ii] and third party reviews: “latencies should push down into the 1-3us range, splitting the difference between current generation DRAM (~80-100ns) and PCIe-based Optane parts (~10us)”[iii] or “As an NVDIMM, 3D XPoint memory would have approximately 20% of the speed of standard volatile DRAM.”[iv]

I just get a chuckle out of Intel’s official comment: “Unlike traditional DRAM, Intel Optane DC persistent memory will offer the unprecedented combination of high-capacity, affordability and persistence,” Lisa Spelman. Notice, Lisa does not say High Performance … good thing that is not a goal of HANA!

Ok, enough of the facts and other analysts comments, we want speculation! Got it. Let us speculate on what happens when memory is 5x slower (10x is just twice as bad). Let us also assume that the 5x or 10x slower is true and that Linux does not utilize a pseudo memory mapped file system (which it does with additional overhead).

In a highly optimized and largely hypothetical world, we will only have analytics using HANA. (Yes, I get it, that is BW or a data lake, not S/4HANA or C/4HANA but I get to determine the parameters of this made-up world.) Let’s consider an overly simplistic query example, e.g. select customer where revenue > 100000. The first 64 byte block of data movement to L3 cache only takes 1us, which at current Skylake 2.5GHz speeds means a wait of 2,500 processor cycles. For sake of argument, we are going to assume no additional latency getting into the processor, no ccNUMA effects or any other delays. The good news is that modern architectures will predict the next access and start loading subsequent data blocks, while query processing of the first block is occurring. Unfortunately, since the DIMM is already busy, this preload has to wait 2,500 processor cycles before it can start transferring the next block of data and 2,500 processor cycles is usually more than sufficient to start and finish any portion of the work against only 64 bytes of data. It is really hard to imagine how queries speed will not be significantly affected by this additional latency. Imaging taking a current HANA BW query that runs in 10 seconds and telling the users to now expect the same result in 50 or 100 seconds. Can you imagine the revolt?

Or consider transaction processing. A typical transaction might require access to data with no-preload possible since these accesses are usually random. So, this access gets a 5x or 10x delay which is radically faster than disk access but much slower than previously experienced with in-DRAM computing? The trouble is that while the transaction incurs that penalty, the processor is just sitting there waiting since this is “main memory” which means that it does not issue a query and wait for an I/O interrupt to remind it to continue processing. So, this access and every access behind it waits 2,500 cycles before continuing, assuming that everything that is required comes across in that first 64 byte chunk. Unfortunately, a transaction accesses data in rows, not in columns which means that each row transaction may involve dozens of individual column accesses, each of which will experience a 5x or 10x delay. Now, extend that to intensely random operations such as delta merge where, instead of a sub-second interruption to individual transaction response time, there is now 5x or 10x increase on all related columnar memory writes. I could continue to extrapolations in save points, batch and external interfaces, but you get the general idea. At currently projected speeds, this sort of slow down in transactional performance could result in project failure.

One other point that must not be overlooked is Intel’s claim about density. While the initial press suggested up to 10x greater density, the DIMM specifications that are currently circulating show up to 512GB per DIMM, a significant increase from the 128GB DIMM max size today (4x not 10x). But can HANA take advantage of that increased density? Prior to Skylake, SAP certified appliances with 8-sockets could only support 8TB of memory despite many having configuration maximums of 12TB. SAP certifications are dependent on meeting performance KPIs and there has always been a pretty direct correlation between numbers of sockets, performance per socket and amount of memory supported. In other words, it takes more and faster cores to support more memory. So, is it reasonable to expect that SAP will discard those KPIs and accept 5x or 10x slower speeds while also jamming 4 times as much memory per socket as is currently supported?

This is not to say that persistent memory has no place in the HANA world. There are many places in which a 5x or 10x memory penalty is worthwhile. Consider the case of a non-prod instance, e.g. test. If it takes 5x or 10x longer, there is little impact to the business operations of most companies, just an increase in the cost of IT and applications professionals. This may be offset against the cost of memory and, in some cases, the math may work. How about HA or DR? No, that does not work as HA and DR must operate like production in the case of a failure or disaster. Certainly aged data that might otherwise reside on disk would see a radical improvement from PMEM or a radically lower cost when compared to DRAM memory in BW extension nodes.

Also, consider that aggressive research is occurring in this field and that future technologies may reduce the penalty to only 2x the speed of DRAM. Would that be close enough to make it worthwhile?

One final thought: The co-inventor of 3D Xpoint memory is Micron. Earlier this year, Micron and Intel decided to go their separate ways with Micron using this technology in their QuantX solutions.[v] Micron is a member of the OpenPower Consortium. Is it possible that they could use this technology to build their own PMEM solutions for Power Systems? If that happens, it would certainly be fascinating to see IBM harvesting the value of PMEM without the marketing and research investment that Intel has put into this.

[i]https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=POS03155USEN

[ii]https://www.youtube.com/watch?v=O0JUCjd_t_0

[iii]https://www.pcper.com/news/Storage/Intel-Launches-Optane-DC-Persistent-Memory-DIMMs-Talks-20TB-QLC-SSDs

[iv]https://searchstorage.techtarget.com/feature/3D-XPoint-memory-stumbles-in-race-to-ditch-DRAM-RRAM-may-step-up

https://www.anandtech.com/bench/product/1967?vs=2067

[v]https://www.anandtech.com/show/12258/intel-and-micron-to-discontinue-flash-memory-partnership

https://www.micron.com/products/advanced-solutions/3d-xpoint-technology

June 22, 2018 Posted by Alfred Freudenberger | Uncategorized | HANA, intel, memory, optane, persistent memory, pmem, Power Systems, SAP | 3 Comments

3D XPoint Memory – The best thing for SAP HANA since HANA was invented?

At #SapphireNow, the Intel booth was all atwitter about the new “game changer”, “revolutionary”, “future of computing”, “best thing since the wheel” (ok, I made that last one up). Yes, they were thrilled with 3D XPoint Optane memory.[i] It is being positioned as persistent memory, like SSD but much faster and which can take the place of real, a.k.a. DRAM, memory … eventually. Paraphrasing them, “You will be able to replace conventional memory with 3D XPoint memory at almost the same speed but which gives you the ability to restart your system after failure in a matter of seconds, not minutes or hours, because the entire HANA image will be stored in persistent memory, not on disk or SSDs.”

This sounds fantastic as long as we completely ignore reality. Let’s dissect the above sentence.

“almost the same speed” – current speculation is that 3DXPoint memory will be about 10 times slower than conventional memory. That is WAY better than external SSD storage, which is around 1000 times slower, but for memory resident applications, like HANA, 10 times slower will result in at least a 10x performance reduction for HANA. Remember, we have no idea how this might affect an application which expects very fast access to memory.

“restart your system after failure:” – silly me, I thought the idea was to prevent failure in the first place. I am curious how often system failure is caused by memory errors or any other cause for which diagnostics might be required to evaluate the underlying problem as well a repair action to fix that problem. Then the question is in which scenario is a customer willing to circumvent diagnostics and return the system to productive use. This also assumes that customers are willing to run mission critical systems without any sort of HA solution such as HANA System Replication or HANA Host Auto-Failover. The use of an HA solution would fail-over production to a secondary system which means that any memory image on the primary system would be out of date almost instantly.

“restart … in seconds” – So, your system has failed for unknown reasons and you are willing to forgo any sort of evaluation of the underlying cause. So far so good. So, Linux is capable of restarting and keeping the memory image as it was before hand and utilizing persistent main memory? Not entirely, but with RHEL 7.3 (not supported for HANA yet), using special device drivers applications may be rewritten to utilize “pmem” for pseudo storage devices.[ii] And HANA is capable of restarting as well from whatever point it was in at the time of failure. Also, did not know HANA could do this and am surprised that SAP prioritized fast restart ahead of the long laundry list of customer provided requirements … which I doubt they did. And HANA can figure out what transactions were in flight at the time of failure, which ones had made some changes to memory, but not all, e.g. started to insert data into a delta table but perhaps had not completed this action at time of failure? Totally wicked!! … and total fantasy, at least for now.

You can easily imagine a variety of other conditions where columns are being updated, e.g. during a delta merge, but have not finished in which some columns contain updated elements and others do not. I am not saying these are insurmountable problems, but considering that you can’t even make a change to the size of a HANA system without restarting HANA currently, it is a massive stretch to imagine how SAP has or is willing to invest the time and effort to make this work for a highly questionable benefit with likely severe performance degradation.

So, 3D XPoint memory as a replacement for conventional memory is clearly all hype, but don’t expect anyone from Intel or their proponents to tell you this. How about as a technology for much faster SSDs? Now we are talking! I doubt there is any reason why this will not be quickly adopted by disk subsystem vendors and available from multiple sources.

As to whether HANA workloads will benefit, that is a different story. Remember, HANA is a read-once workload. Once a column is loaded into memory, it is never read again until unloaded and this should only occur if the memory subsystem is undersized or the system is restarted after maintenance. So, fast storage is useful for restarts, but super-fast storage is only needed when a system must return to full operation after maintenance very quickly and without any performance degradation, i.e. every column loaded into memory, in 10 minutes or so. Just as a point of comparison, IBM ran a test with 10 NVMe cards and delivered about 1TB per minute when restarting HANA. To the best of my knowledge, few customers have expressed more than a passing interest in this capability. I could imagine a scenario in which customers are willing to put a somewhat recent tier of data, e.g. 1 to 2 year old data, on persistent main memory, with perhaps external, and orders of magnitude slower, storage used for older data. Once again, this is a nice concept but until SAP writes or adopts code to enable this, it is just a theory.

As to writes, most enterprise storage subsystems can deliver response times that are twice as fast as SAP requires. IBM SVC (SAN Volume Controller) connected to an IBM Power System has been tested in real customer installations and has delivered the fastest times of any storage subsystem in the industry with a peak latency of only 161us (microseconds) for 4K block size log writes as measured by HWCCT or over 6 times better latency than what SAP requires. SVC is part of a family of products including V7000, V9000 and Spectrum Virtualization Software which all utilize similar concepts and software.

In other words, you don’t have to wait for tomorrow to get fast restarts and minimized transactional log writes, you just need to select the write infrastructure partner, IBM.

[i] https://www.theregister.co.uk/2017/05/17/coming_xeon_sps_will_run_sap_hana_16_times_faster/

[ii] https://developers.redhat.com/blog/2016/12/05/configuring-and-using-persistent-memory-rhel-7-3/

June 12, 2017 Posted by Alfred Freudenberger | Uncategorized | 3d xpoint, HANA, intel, optane, persistent, SAP | 4 Comments

Email Subscription

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Email Address:

Join 435 other subscribers
Search for:
Recent Posts
Blog Stats
- 116,135 hits

Follow me on Twitter
My Tweets

SAPonPower

An ongoing discussion about SAP infrastructure