SAPonPower

An ongoing discussion about SAP infrastructure

Optane DC Persistent Memory – Proven, industrial strength or full of hype – Detail, part 3

In this final of a three part series, we will explore the two other major “benefits” of Optane DIMMs: fast restart and TCO.

Fast restart

HANA, as an in-memory database, must be loaded into memory to perform well.  Intel, for years and, apparently up to current times, has suffered with a major bottleneck in its I/O subsystem.  As a result, loading a single terabyte of data into memory could take 10 to 20 minutes in a best-case scenario.  Anecdotally, some customers have remarked that placing superfast, all flash subsystems, such as IBM’s FlashSystem 9100, behind an Intel HANA system resulted in little improvement in load times compared to mid-range SSD subsystems.  For customers attempting to bring up a 10TB storage/20TB memory HANA system, this could result in load times measured in hours.  As a result, a faster way of getting a HANA system up and running was sorely needed.

This did not appear to be a problem for customers using IBM’s Power Systems.  Not only has Power delivered roughly twice the I/O bandwidth of Intel systems for years, but with POWER9, IBM introduced PCIe Gen4, further extending their leadership in this area.  The bottleneck is actually in the storage subsystem and number of paths that it can drive, not in the processor.  To prove this, IBM ran a test with 10 NVMe cards in PCIe slots and was able to drive load speeds into HANA of almost 1TB/min.[I].  In other words, to improve restart times, Power Systems customers need only move to faster subsystems and/or add more or faster paths.

This suggests that Intel’s motivation for NVDIMMs may be to solve a problem of their own making.  But this also raises a question of their understanding of HANA.  If a customer is running a transactional workload such as Suite on HANA, S/4 or C/4, and is using HANA System Replication, wouldn’t at least one of the pair of nodes be available at all times?  SAP supports near zero upgrades[ii], so systems, firmware, OS or even HANA itself may be updated on one of the pair of nodes while the other continues to operate, followed by a synchronization of changed data and a controlled failover so that the first node might be updated.  In this way, cold restarts of HANA, where a fast restart option might make a big difference, may be driven down into a very rare occurrence.  In other words, wouldn’t this be a better option than causing poor performance to everything due to radically slower DIMMs compared to DRAM as has been discussed in gory detail on the previous two posts of this series?

HANA also offers a quick restart option whereby HANA can be started and the database made available within minutes even though all of the columns have not yet been loaded into memory. Yes, performance will be pretty bad until all columns are loaded into memory, but for non-production systems and non-mission critical systems, this might be an acceptable option.  Lastly, with HANA 2.0 SPS04, SAP now supports fast restart with conventional memory.[iii]  This only works when the OS stays up and running, i.e. can’t be used when the system, firmware or OS is being updated, but this can be used for the vast majority of required restarts, e.g. HANA upgrades, patches and restarts when a bounce of the HANA environment is needed.  Though this is not mentioned in the help documentation, it may even be possible to patch the Linux kernel while using the fast restart option if SUSE SLES is used with their “Live Patching” function.[iv]

TCO

Optane DIMMs are less expensive than DRAM DIMMs.  List prices appears to be about 40% cheaper when comparing same size DIMMs.  Effective prices, however, may have a much smaller delta since there exists competition for DRAM meaning discounts may be much deeper than for the NVDIMMs from Intel, currently the only source.  This assumes full utilization of those NVDIMMs which may prove to be a drastically bad assumption.  Sizing guidance from SAP[v]shows that the ratio of DRAM vs. PMEM (their term for NVDIMMs) capacity can be anything from 2:1 to 1:4, but it provides no guidance as to where a given workload might fall or what sort of performance impact might result.  This means that a customer might purchase NVDIMMs with a capacity ratio of 1:2, e.g. 1TB DRAM:2TB PMEM, but might end up only being able to utilize only 512GB or 1TB PMEM due to negative performance results.  In that case, the cost of effective NVDIMMs would have instantly doubled or quadrupled and would, effectively, be more expensive than DRAM DIMMs.

But let us assume the best rather than the worst.  Even if only a 2:1 ratio works relatively well, the cost of the NVDIMMs, if sized for that ratio, would be somewhat lower than the equivalent cost of DRAM DIMMs. The problem is that memory, while a significant portion of the cost of systems, is but one element in the overall TCO of a HANA landscape.  If reducing TCO is the goal, shouldn’t all options be considered?

Virtualization has been in heavy use by most customers for years helping to drive up system utilization resulting in the need for fewer systems, decreasing network and SAN ports, reducing floor space and power/cooling and, perhaps most importantly, reducing the cost of IT management.  Unfortunately, few high end customers, other than those using IBM Power Systems can take advantage of this technology in the HANA world due to the many reasons identified in the latest of many previous posts.[vi]  Put another way, if a customer utilizes an industrial strength and proven virtualization solution for HANA, i.e. IBM PowerVM, they may be able to reduce TCO considerably[vii]and potentially much more than the relatively small improvement due to NVDIMMs.

But if driving down memory costs is the only goal, there are a couple of ideas that are less radical than using NVDIMMs worth investigating.  Depending on RTO requirements, some workloads might need an HA option, but might not require it to be ready in minutes.  If this is the case, then a cold standby server running other workloads which could be killed in the event of a system outage could be utilized, e.g. QA, Dev, Test, Sandbox, Hadoop.  Since no incremental memory would be required, memory costs would be substantially lower than that required for System Replication, even if NVDIMMs are used. IBM offers a tool called VM Recovery Manager which can instrument and automate such a configuration.

Another option worth considering, only for non-production workloads, is a feature of IBM PowerVM called Memory Deduplication.  After different VMs are started using “a shared memory pool”, the hypervisor builds a logical memory map.  It then scans the pages of each VM looking for identical memory pages at which time it uses the logical memory map to point each VM to the same real memory page thereby freeing up the redundant memory pages for use by other workloads.  If a page is subsequently changed by one of the VMs, the hypervisor simply recreates a unique real memory page for that VM. The upshot of this feature is that the total quantity of DRAM memory may be reduced substantially for workloads that are relatively static and have large amounts of duplication between them. The reason that this should not be used for production is because when the VMs start, the hypervisor has not yet had the chance to deduplicate the memory pages and, if the sum of logical memory of all VMs is larger than the total memory, paging will occur.  This will subside over time and may be of little consequence to non-production workloads, but the risk to performance for production might be considered unacceptable and, besides, “Memory over-commitment must not be used” for production HANA according to SAP.

Summary

Faster restarts than may be possible with traditional Intel systems may be achieved by using near zero HANA upgrades with System Replication, HANA fast restart or by switching to a system with a radically faster I/O subsystem, e.g. IBM Power Systems. TCO may be reduced with tried and proven virtualization technologies as provided with IBM PowerVM, cold standby systems or memory deduplication rather than experimenting with version 1.0 of a new technology with no track record, unknown reliability, poor guidance on sizing and potentially huge impacts to performance.

 

[i]https://www.ibm.com/downloads/cas/WQDZWBYJ

[ii]https://launchpad.support.sap.com/#/notes/1984882

[iii]https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.04/en-US/ce158d28135147f099b761f8b1ee43fc.html

[iv]https://launchpad.support.sap.com/#/notes/1984787

[v]https://launchpad.support.sap.com/#/notes/2786237

[vi]https://saponpower.wordpress.com/2018/09/26/vmware-pushes-past-4tb-sap-hana-limit/

[vii]https://www.ibm.com/downloads/cas/M7X2YXZD

June 3, 2019 Posted by | Uncategorized | , , , , , , , , , , , , , , | 1 Comment

The top 3 things that SAP needs are memory, memory and I can’t remember the third. :-) A review of the IBM Power Systems announcements with a focus on the memory enhancements.

While this might not exactly be new news, it is worthwhile to consider the value of the latest Power Systems announcements for SAP workloads.  On October 12, 2011, IBM released a wide range of enhancements to the Power Systems family.  The ones that might have received the most publicity, not to mention new model numbers, were valuable but not the most important part of the announcement, from my point of view.  Yes, the new higher MHz Power 770 and 780 and the ability to order a 780 with 2 chips per socket thereby allowing the system to grow to 96 cores were certainly very welcome additions to the family.  Especially nice was that the 3.3 GHz processors in the new MMC model of the 770 came in at the same price as the 3.1 GHz processors in the previous MMB model.  So, 6.5% more performance at no additional cost.

For SAP, however, raw performance often takes second fiddle to memory.   The old rule is that for SAP workloads, we run out of memory long before we run out of CPU.   IBM started to address this issue in 2010 with the announcement of the Active Memory Expansion (AME)  feature of POWER7 systems.  This feature allows for dynamic compression/decompression of memory pages thereby making memory appear to be larger than it really is.   The administrator of a system can select the target “expansion” and the system will then build a “compressed” pool in memory into which pages are compressed and placed starting from those pages less frequently accessed to those more frequently accessed.  As pages are touched, they are uncompressed and moved into the regular memory pool from which they are accessed normally.  Applications run unchanged as AIX performs all of the moves without any interaction or awareness required by the application.   The point at which response time, throughput or a large amount of CPU overhead starts to occur is the “knee of the curve”, i.e. slightly higher than the point at which the expansion should be set.  A tool, called AMEPAT, allows the administrator to “model” the workload prior to turning AME on, or for that matter on older hardware as long as the OS level is AIX 6.1 TL4 SP2 or later.

Some workloads will see more benefit than others.  For instance, during internal test run by IBM, the 2-tier SD benchmark showed outstanding opportunities for compression and hit 111% expansion, e.g. 10GB of real memory appears to be 21GB to the application, before response time or thoughput showed any negative effect from the compression/decompression activity.  During testing of a retail BW workload, 160% expansion was reached.  Even database workloads tend to benefit from AME.  DB2 database, which already feature outstanding compression, have seen another 30% or 40% expansion.  The reason for this difference comes from the different approaches to compression.  In DB2, if 1,000 residences or business have an address on Main Street,  Austin, Texas,  (had to pick a city so selected my own) DB2 replaces Main Street, Austin, Texas in each row with a pointer to another table that has a single row entitled Main Street, Austin, Texas.  AME, by comparison, is more of an inline compression, e.g. if it sees a repeating pattern, it can replace that pattern with a symbol that represents the pattern and how often it repeats.  Oracle recently announced that they would also support AME.  The amount of expansion with AME will likely vary from something close to DB2, if Oracle Advanced Compression is used, to significantly higher if Advanced Compression is not used since many more opportunities for compression will likely exist.

So, AME can help SAP workloads close the capacity gap between memory and CPU.  Another way to view this is that this technology can decrease the cost of Power Systems by either allowing customers to purchase less memory or to place more workloads on the same system, thereby driving up utilization and decreasing the cost per workload.  It is worthwhile to note than many x86 systems have also tried to address this gap, but as none offer anything even remotely close to AME, they have instead resorted to more DIMM slots.  While this is a good solution, it should be noted that twice the number of DIMMs requires twice the amount of power and cooling and suffers from twice the failures, i.e. TANSTAFL: there ain’t no such thing as a free lunch.

In the latest announcements, IBM introduced support for the new 32GB dimms.  This effectively doubled the maximum memory on most models, from the 710 through the 795.  Combined with AME, this decreases or eliminates the gap between memory capacity and  CPU and makes these models even more cost effective since more workloads can share the same hardware.  Two other systems received similar enhancements recently, but these were not part of the formal announcement.  The two latest blades in the Power Systems portfolio, the PS703 and the PS704, were announced earlier this year with twice the number of cores but the same memory as the PS701 and PS702 respectively.  Now, using 16GB DIMMS, the PS703/PS704 can support up to 256GB/512GB of memory making these blades very respectable especially for application server workloads.  Add to that, with the Systems Director Management Console (SDMC) AME can be implemented for blades allowing for even more effective memory per blade.   Combined, these blades have closed the price difference even further compared to similar x86 blades.

One last memory related announcement may have been largely overlooked by many because it involved an enhancement to the Active Memory Sharing (AMS) feature of PowerVM.  AMS has historically been a technology that allowed for overcommitment of memory.  While CPU overcommitment is now routine, memory overcommitment means that some % of memory pages will have to be paged out to solid state or other types of disk.  The performance penalty is well understood making this not appropriate for production workloads but potentially beneficial for many other non-prod, HA or DR workloads.  That said, few SAP customers have implemented this technology due to the complexity and performance variability that can result.  The new announcement introduces Active Memory™ Deduplication for AMS implementations.   Using this new technology, PowerVM will scan partitions after they finish booting and locate  identical pages within and across all partitions on the system.  When identical pages are detected, all copies, except one, will be removed and all memory references will point to the same “first copy” of the page.   Since PowerVM is doing this, even the OSs can be unaware of this action.  Instead, as this post processing proceeds, the PowerVM free memory counter will increase until a steady state has been reached.  Once enough memory is freed up in this manner, new partitions may be started.  It is quite easy to imagine that a large number of pages are duplicates, e.g. each instance of an OS has many read only pages which are identical and multiple instances of an application, e.g. SAP app servers, will likewise have executable pages which are identical.  The expectation is that another 30% to 40% effective memory expansion will occur for many workloads using this new technology.  One caveat however; since the scan is after a partition boots, operationally it will be important to have a phased booting schedule to allow for the dedupe process to free up pages prior to starting more partitions thereby avoiding the possibility of paging.  Early testing suggests that the dedupe process should arrive at a steady state approximately 20 minutes after partitions are booted.

The bottom line is that with the larger DIMMS, AME and AMS Memory Deduplication, IBM Power Systems are in a great position to allow customers to fully exploit the CPU power of these systems by combining even more workloads together on fewer servers.  This will effectively drive down the TCA for customers and remove what little difference there might be between Power Systems and systems from various x86 vendors.

November 29, 2011 Posted by | Uncategorized | , , , , , , , , , , , , | 4 Comments