SAPonPower

An ongoing discussion about SAP infrastructure

Optane DC Persistent Memory – Proven, industrial strength or full of hype – Detail, part 3

In this final of a three part series, we will explore the two other major “benefits” of Optane DIMMs: fast restart and TCO.

Fast restart

HANA, as an in-memory database, must be loaded into memory to perform well.  Intel, for years and, apparently up to current times, has suffered with a major bottleneck in its I/O subsystem.  As a result, loading a single terabyte of data into memory could take 10 to 20 minutes in a best-case scenario.  Anecdotally, some customers have remarked that placing superfast, all flash subsystems, such as IBM’s FlashSystem 9100, behind an Intel HANA system resulted in little improvement in load times compared to mid-range SSD subsystems.  For customers attempting to bring up a 10TB storage/20TB memory HANA system, this could result in load times measured in hours.  As a result, a faster way of getting a HANA system up and running was sorely needed.

This did not appear to be a problem for customers using IBM’s Power Systems.  Not only has Power delivered roughly twice the I/O bandwidth of Intel systems for years, but with POWER9, IBM introduced PCIe Gen4, further extending their leadership in this area.  The bottleneck is actually in the storage subsystem and number of paths that it can drive, not in the processor.  To prove this, IBM ran a test with 10 NVMe cards in PCIe slots and was able to drive load speeds into HANA of almost 1TB/min.[I].  In other words, to improve restart times, Power Systems customers need only move to faster subsystems and/or add more or faster paths.

This suggests that Intel’s motivation for NVDIMMs may be to solve a problem of their own making.  But this also raises a question of their understanding of HANA.  If a customer is running a transactional workload such as Suite on HANA, S/4 or C/4, and is using HANA System Replication, wouldn’t at least one of the pair of nodes be available at all times?  SAP supports near zero upgrades[ii], so systems, firmware, OS or even HANA itself may be updated on one of the pair of nodes while the other continues to operate, followed by a synchronization of changed data and a controlled failover so that the first node might be updated.  In this way, cold restarts of HANA, where a fast restart option might make a big difference, may be driven down into a very rare occurrence.  In other words, wouldn’t this be a better option than causing poor performance to everything due to radically slower DIMMs compared to DRAM as has been discussed in gory detail on the previous two posts of this series?

HANA also offers a quick restart option whereby HANA can be started and the database made available within minutes even though all of the columns have not yet been loaded into memory. Yes, performance will be pretty bad until all columns are loaded into memory, but for non-production systems and non-mission critical systems, this might be an acceptable option.  Lastly, with HANA 2.0 SPS04, SAP now supports fast restart with conventional memory.[iii]  This only works when the OS stays up and running, i.e. can’t be used when the system, firmware or OS is being updated, but this can be used for the vast majority of required restarts, e.g. HANA upgrades, patches and restarts when a bounce of the HANA environment is needed.  Though this is not mentioned in the help documentation, it may even be possible to patch the Linux kernel while using the fast restart option if SUSE SLES is used with their “Live Patching” function.[iv]

TCO

Optane DIMMs are less expensive than DRAM DIMMs.  List prices appears to be about 40% cheaper when comparing same size DIMMs.  Effective prices, however, may have a much smaller delta since there exists competition for DRAM meaning discounts may be much deeper than for the NVDIMMs from Intel, currently the only source.  This assumes full utilization of those NVDIMMs which may prove to be a drastically bad assumption.  Sizing guidance from SAP[v]shows that the ratio of DRAM vs. PMEM (their term for NVDIMMs) capacity can be anything from 2:1 to 1:4, but it provides no guidance as to where a given workload might fall or what sort of performance impact might result.  This means that a customer might purchase NVDIMMs with a capacity ratio of 1:2, e.g. 1TB DRAM:2TB PMEM, but might end up only being able to utilize only 512GB or 1TB PMEM due to negative performance results.  In that case, the cost of effective NVDIMMs would have instantly doubled or quadrupled and would, effectively, be more expensive than DRAM DIMMs.

But let us assume the best rather than the worst.  Even if only a 2:1 ratio works relatively well, the cost of the NVDIMMs, if sized for that ratio, would be somewhat lower than the equivalent cost of DRAM DIMMs. The problem is that memory, while a significant portion of the cost of systems, is but one element in the overall TCO of a HANA landscape.  If reducing TCO is the goal, shouldn’t all options be considered?

Virtualization has been in heavy use by most customers for years helping to drive up system utilization resulting in the need for fewer systems, decreasing network and SAN ports, reducing floor space and power/cooling and, perhaps most importantly, reducing the cost of IT management.  Unfortunately, few high end customers, other than those using IBM Power Systems can take advantage of this technology in the HANA world due to the many reasons identified in the latest of many previous posts.[vi]  Put another way, if a customer utilizes an industrial strength and proven virtualization solution for HANA, i.e. IBM PowerVM, they may be able to reduce TCO considerably[vii]and potentially much more than the relatively small improvement due to NVDIMMs.

But if driving down memory costs is the only goal, there are a couple of ideas that are less radical than using NVDIMMs worth investigating.  Depending on RTO requirements, some workloads might need an HA option, but might not require it to be ready in minutes.  If this is the case, then a cold standby server running other workloads which could be killed in the event of a system outage could be utilized, e.g. QA, Dev, Test, Sandbox, Hadoop.  Since no incremental memory would be required, memory costs would be substantially lower than that required for System Replication, even if NVDIMMs are used. IBM offers a tool called VM Recovery Manager which can instrument and automate such a configuration.

Another option worth considering, only for non-production workloads, is a feature of IBM PowerVM called Memory Deduplication.  After different VMs are started using “a shared memory pool”, the hypervisor builds a logical memory map.  It then scans the pages of each VM looking for identical memory pages at which time it uses the logical memory map to point each VM to the same real memory page thereby freeing up the redundant memory pages for use by other workloads.  If a page is subsequently changed by one of the VMs, the hypervisor simply recreates a unique real memory page for that VM. The upshot of this feature is that the total quantity of DRAM memory may be reduced substantially for workloads that are relatively static and have large amounts of duplication between them. The reason that this should not be used for production is because when the VMs start, the hypervisor has not yet had the chance to deduplicate the memory pages and, if the sum of logical memory of all VMs is larger than the total memory, paging will occur.  This will subside over time and may be of little consequence to non-production workloads, but the risk to performance for production might be considered unacceptable and, besides, “Memory over-commitment must not be used” for production HANA according to SAP.

Summary

Faster restarts than may be possible with traditional Intel systems may be achieved by using near zero HANA upgrades with System Replication, HANA fast restart or by switching to a system with a radically faster I/O subsystem, e.g. IBM Power Systems. TCO may be reduced with tried and proven virtualization technologies as provided with IBM PowerVM, cold standby systems or memory deduplication rather than experimenting with version 1.0 of a new technology with no track record, unknown reliability, poor guidance on sizing and potentially huge impacts to performance.

 

[i]https://www.ibm.com/downloads/cas/WQDZWBYJ

[ii]https://launchpad.support.sap.com/#/notes/1984882

[iii]https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/2.0.04/en-US/ce158d28135147f099b761f8b1ee43fc.html

[iv]https://launchpad.support.sap.com/#/notes/1984787

[v]https://launchpad.support.sap.com/#/notes/2786237

[vi]https://saponpower.wordpress.com/2018/09/26/vmware-pushes-past-4tb-sap-hana-limit/

[vii]https://www.ibm.com/downloads/cas/M7X2YXZD

Advertisements

June 3, 2019 Posted by | Uncategorized | , , , , , , , , , , , , , , | 1 Comment

Optane DC Persistent Memory – Proven, industrial strength or full of hype – Detail, part 2

If the performance considerations from part 1 were the only issues, a reasonable case could be made for the potential value of doing a PoC with this technology.  But, of course, those are not the only issues.  One of the reasons that NVDIMMs have longer latencies than DRAM is due to their persistence and therefore the need to encrypt data placed on these components.  Encryption and decryption take a lot of computational power and can have a substantial impact on latency and bandwidth.  The funny thing is that encryption of these NVDIMMs can be turned off if desired, presumably with a resulting improvement to performance.  But what kind of customer would be willing to turn off this vital security technology?

Another desirable trait of modern, in-memory platforms is advanced memory protection which allows a system to continue to operate in the event of a DIMM failure.  This often starts with basic ECC, but then progresses to SDDC, DDDC (Chipkill or Lockstep), ADDDC (Skylake and beyond only) and IBM’s unique Chipkill + chip sparing technology.  ADDDC is not available for NVDIMMs, but DDDC is.  The downside of DDDC is that it comes with a significant performance penalty. No performance numbers have been provided for NVDIMMs configured with DDDC, but previous generations saw 20% to 40% degradation when using this mode.[i][ii]

What kind of customer would be willing to disable key security features or run critical systems without the best available reliability technologies?  I would certainly advise customers to use encryption and advanced reliability technologies in most circumstances.  Only those customers that can scramble business critical, PII and/or HIPAA data should ever consider disabling persistent memory encryption.  I searched, using every option that I could imagine, and failed to find a single web site that recommended ever disabling NVDIMM encryption.

SAP Benchmarks results posted on the external web site do not show the details of how security and reliability configuration parameters have been set.  It is therefore impossible to say whether HPE enabled or disabled these protection features.  In my many years of experience and extensive discussion with benchmarking experts, I can share that every single one, at every vendor, used every tool or technology that did not violate official rules to enhance results.  It would not be too much of a leap to project that HPE, and other vendors posting results with NVDIMMs, have likely disabled anything that might cause their results to diminish in any way.  (HPE, if you would like to share your configuration details, I would be happy to post them and if I have mischaracterized how you ran these benchmarks, will also post a retraction.) As a result, these BWH results may not only have relevance to only a small subset of the potential workloads but may also represent an unacceptable exposure to any company that has high single system availability requirements or has one of those unreasonable security departments which thinks that data protection is actually worthwhile.

And then, there are OLTP customers.  Based on the lack of benchmark testing of Suite on HANA, S/4HANA or C/4HANA combined with the above data from Lenovo about the massive reduction of bandwidth and associated huge increase in latency for OLTP, it would be MOST unwise to place any of these types of environments on systems with NVDIMMs without extensive testing of real customer workloads to ensure that internal performance SLAs can be met.

Certain types of workloads may perform decently with NVDIMMs.  BW environments where the primary use is for predictable and repeatable queries and reports may see only moderate performance degradation compared to DRAM based systems, but still orders of magnitude better performance that AnyDB systems which merely cache recently used data in memory and keep most data on external storage.  BW Extension nodes, S/4 Data aging objects and other types of archival systems that take older, less frequently used data and place them on other tiers of storage or systems, could certainly benefit from NVDIMMs.  Non-prod workloads which are not in the critical path to production, e.g. dev, test, sandbox, might make sense to place on systems with NVDIMMs.  All of these depend on an acceptance of potential performance issues and hardware/firmware/software fixes that inevitably come once customers start playing with version 1.0 of any new technology.

Based on likely performance issues, inferior RAS technology and the above mentioned “fix” dilemma, I would strongly advise that critical systems like production, QA, pre-prod, HA and DR should stay on DRAM based systems until bleeding edge customers prove the value of NVDIMMs and are willing to publicly share their journey.

The question then becomes whether the benefit to a subset of the environments are so substantial that it makes sense to select a vendor for HANA systems based on their ability to utilize NVDIMMs even when this technology might not be used for the most critical of the workloads and their associated critical path and HA/DR systems. This gets into the subjects of cost reduction and restart speeds which will be covered in part 3 of this series.

[i]https://lenovopress.com/lp0048.pdf

[ii]https://sp.ts.fujitsu.com/dmsp/Publications/public/wp-broadwell-ex-memory-performance-ww-en.pdf

May 27, 2019 Posted by | Uncategorized | , , , , , , , , , , , | Leave a comment

Optane DC Persistent Memory – Proven, industrial strength or full of hype – Detail, part 1

Several non-Intel sites suggest that Intel’s storage class memory (Lenovo abbreviates these as DCPMM, while many others refer to them with the more generic term NVDIMM) delivers a read latency of roughly 5 times slower than DRAM, e.g. 350 nanoseconds for NVDIMM vs. 70 nanoseconds for DRAM.[i]  A much better analysis comes from Lenovo which examined a variety of load conditions and published their results in a white paper.[ii]  Here are some of the results:

  • A fully populated 6x DCPMM socket could deliver up to 40GB/s read throughput, 15GB/s write
  • Each additional pair of DCPMMs delivered proportional increases in throughput
  • Random reads had a load to use latency that was roughly 50% higher than sequential reads
  • Random reads had a max per socket (6x DCPMM) throughput that was between 10 and 13GB/s compared to 40 to 45GB/s for sequential reads

The most interesting quote from this section was: “Overall, workloads that are more read intensive and sequential in nature will see the best performance.”  This echoes the quote from SAP’s NVRAM white paper: “From the perspective (of) read accesses, sequential scans fare better in NVRAM than point reads: the cache line pre-fetch is expected to mitigate the higher latency.[iii]

The next section is even more interesting.  Some of its results comparing the performance differences of DRAM to DCPMM were:

  • Almost 3x better max sequential read bandwidth
  • Over 5x better max random read bandwidth
  • Over 5x better max sequential 2:1 R/W bandwidth
  • Over 8x better max random 2:1 R/W bandwidth
  • Latencies for DCPMM in the random 2:1 R/W test hit a severe knee of the curve and showed max latencies over 8x that of DRAM at very light bandwidth loads
  • DRAM, by comparison, continued to deliver significantly increasing bandwidth with only a small amount of latency degradation until it hit a knee of the curve at over 10x of the max DCPMM bandwidth

Unfortunately, this is not a direct indication of how an application like HANA might perform.  For that, we have to look at available benchmarks. To date, none of the SD benchmarks have utilized NVDIMMs.  Lenovo published a couple of BWH results, one with and one without NVDIMMs, but used different numbers of records, so they are not directly comparable.  HPE, on the other hand, published a couple of BWH results using the exact same systems and numbers of records.[iv]  Remarkably, only a small, 6% performance degradation, going from an all DRAM 3TB configuration to a mixed 768GB/3TB NVDIMM configuration occurred in the parallel query execution phase of the benchmark.  The exact configuration is not shown on the public web site, but we can assume something about the config based on SAP Note: 2700084 – FAQ: SAP HANA Persistent Memory: To achieve highest memory performance, all DIMM slots have to be used in pairs of DRAM DIMMs and persistent memory DIMMs, i.e. the system must be equipped with one DRAM DIMM and one NVDIMM in each memory channel.”  Vendors submitting benchmark results do not have to follow these guidelines, but if HPE did, then they used 24@32GB DRAM DIMMs and 24@128TB NVDIMMs.  Also, following other guidelines in the same SAP Note and the SAP HANA Administration Guide, HPE most likely placed the column store on NVDIMMS with row store, caches, intermediate and final results calculations on DRAM DIMMs.

BWH is a benchmark composed of 1.3 billion records which can easily be loaded into a 1TB system with room to spare.  To achieve larger configurations, vendors can load the same 1.3B records a second, third or more times, which HPE did a total of 5 times to get to 6.5B records.  The column compression dictionary tables, only grow with unique data, i.e. do not grow when you repeat the same data set regardless of the number of times it is added.

BWH includes 3 phases, a load phase which represents data ingestion from ERP, a parallel query phase and a sequential, single user complex query phase.  Some have focused on the ingestion and complex query phases, because they show the most degradation in performance vs. DRAM.  While that is tempting, I believe the parallel query phase is of the most relevance.  During this phase, 385 queries of low, medium and high complexity (no clue as to how SAP defines those complexities, what their SQL looks like or how many of each type are included) are run, in parallel and randomly.  After an hour, the total count of queries completed is reported. In theory, the larger the database, the fewer the queries that could be run per hour as each query would have more data to traverse.  However, that is not what we see in these results.

Lenovo, once again, provides the best insights here.  With Skylake processors, they reported two results.  On the first, they loaded 1.3B records, on the second 5.2B records or 4 times the number of rows with only twice the memory.  One might predict that queries per hour would be 4 times or more worse considering the non-proportionate increase in memory.  The results, however, show only a little over 2x decrease in Query/hr. Dell reported a similar set of results, this time with Cascade Lake, also with only real memory and also only around 2x decrease in Query/hr for 4X larger number of records.

What does that tell us? It is impossible to say for sure. From the SAP NVRAM white paper referenced earlier, “One can observe that some of the queries are more sensitive to the latency of the persistent memory than others. This can be explained by multiple factors:

  1. Does the query exhibit a memory access pattern that can easily prefetch by the hardware
  2. prefetchers? Is the working set of queries small enough to fit in CPU
  3. cache and hence agnostic to persistent memory latency? Is processing of the query compute or latency bound?”

SAP stores results in the “Static Cache”. “The static result cache is particularly helpful in the following scenario:  Complex query based on a view; Rather small result set; Limited amount of changes in the underlying tables.  The static result cache can provide the following advantages: Reduction of CPU consumption; Reduction of SAP HANA thread utilization; Performance improvements[v]

Other areas like delta storage, caches, intermediate result sets or row store remain solely in dynamic RAM (DRAM) is usually stored in DRAM, not NVDIMMs.[vi]

The data in BWH is completely static.  Some queries are complex and presumably based on views.   Since the same queries execute over and over again, prefetchers may become especially effective.  It may be possible that some or many of the 385 queries in BWH may be hitting the results cache in DRAM.  In other words, after the first set of queries run, a decent percentage of accesses may be hitting only the DRAM portion of memory, masking much of the latency and bandwidth issues of NVRAM.  In other words, this benchmark may actually be testing CPU power against a set of results cached in working memory more than actual query speed against column store.

So, let us now consider the HPE benchmark with NVDIMMs.  On the surface, 6% degradation with NVDIMMs vs. all DRAM seems improbable considering NVDIMM higher latency/lower bandwidth.  But after considering the above caching, repetitive data and repeating query set, it should not be much of a shock that this sort of benchmark could be masking the real performance effects.  Then we should consider the quote from Lenovo’s white paper above which said that NVDIMMs are a great technology for read intensive, sequential workloads.

Taken together, while not definitive, we can deduce that a real workload using more varied and random reads, against a non-repeating set of records might see a substantially different query throughput than demonstrated by this benchmark.

Believe it or not, there is even more detail on this subject, which will be the focus of a part 2 post.

 

[i]https://www.pcper.com/news/Storage/Intels-Optane-DC-Persistent-Memory-DIMMs-Push-Latency-Closer-DRAM

[ii]https://lenovopress.com/lp1083.pdf

[iii]http://www.vldb.org/pvldb/vol10/p1754-andrei.pdf

[iv]https://www.sap.com/dmc/exp/2018-benchmark-directory/#/bwh

[v]https://launchpad.support.sap.com/#/notes/2336344

[vi]https://launchpad.support.sap.com/#/notes/2700084

May 20, 2019 Posted by | Uncategorized | , , , , , , , , , , , | Leave a comment

Persistent Memory for HANA @ SapphireNow Orlando 2018

Once again, Intel and the companies that utilize their processors were all abuzz at Sapphire about Intel Optane DC Persistent Memory (PMEM).  This is the second year in a row that they have been touting this future technology and its ability to fit into a DIMM form factor and take the place of some of the main memory currently supply by DRAM. I was intrigued until I saw Hasso Plattner, at SapphireNow 2018 Orlando, explain how HANA would utilize this technology.  He showed a chart where a 6TB HANA DB startup time of 50 minutes reduced to 4 min with a 50/50 mix of standard DRAM DIMMs and the new Intel PMEM DIMMs.  As he explained it, HANA column store would reside in PMEM, while working space and delta store tables would reside in DRAM.

50 minutes down 4 minutes sounds outstanding, but let’s see if we can pull back the veil a bit.

Who created the chart?  Dr. Plattner was vague about this. He suggested that it might be from an internal test.  When I asked multiple vendors, including Intel, if any parts were available for customer testing, I was told no and that it would require Cascade Lake, Intel’s next version of the current Skylake chips, to drive these new DIMMs.  I suspect that Dr. Plattner was referring to the 50 minutes as being from an internal test.  This means that the 4 minute projection with PMEM may have come from another source, e.g. Intel.

Why 50 minutes?  It might seem reasonable to assume that if this was an internal test, that SAP knows how to configure a system properly, so it was probably using best of breed SSD technology, e.g. Intel’s SSD 750.  50 minutes works out to roughly 122GB/min after HANA SW load.  IBM published a white paper in which Power Systems achieved approximately 172GB/min (30% faster) with a typical mid-range SSD subsystem and almost 1 TB/min with NVMe based SSDs, i.e. 740% faster.[i]  In other words, if 50 minutes for 6TB is longer than acceptable, Power Systems can already deliver radically faster startup time without using esoteric and untested memory concepts.

For the Intel world, getting down from 50 minutes to 4 minutes would be quite a feat, but how often is this sort of restart likely to happen?  Assuming an SAP client is not using one of the near-zero maintenance options, this depends on the frequency of such updates but most typically a couple of times per year.  More often, for Intel customers, predictive failure analysis on memory will call out memory DIMMs for replacement once or twice a month, so more frequent reboots may be required.  Of course, using a more reliable memory technology such as that offered in Power Systems could alleviate this requirement.  It is ironic that the use of unreliable DRAM memory options in x86 systems could be the very cause of why faster restarts are needed.

Speaking of reliability, is PMEM RAID protected like disk?  The answer, based on what has been published so far, appears to be no.  In other words, if a PMEM DIMM were to fail, not only could this cause the system to fail, but since this would result in an incomplete or corrupted memory image, reload from the storage subsystem would still be required.  Even more irony that the fast restart functionality of PMEM would be of no use when PMEM itself is the cause of the outage.  Also, this would be the first commercial use of this technology. Good thing that Version 1.0 of anything usually works perfectly!

Next, let us consider the effect of having column store in PMEM.  If SAP has said it once, they have said it 1,000 times, the “H” in “HANA” refers to “High Performance”.  If you slow down access to column store by a factor of 5x or 10x, you get a cascading effect on just about every possible KPI in the system.  Wait, did I say 5x or 10x?  I hate it when I have to resort to quoting the source: “Intel senior vice president Rob Crooke and Micron CEO Mark Durcan declared 3D Xpoint to be 1,000 times faster and 1,000 times more durable than NAND[ii] and third party reviews: “latencies should push down into the 1-3us range, splitting the difference between current generation DRAM (~80-100ns) and PCIe-based Optane parts (~10us)”[iii] or “As an NVDIMM, 3D XPoint memory would have approximately 20% of the speed of standard volatile DRAM.”[iv]

I just get a chuckle out of Intel’s official comment: “Unlike traditional DRAM, Intel Optane DC persistent memory will offer the unprecedented combination of high-capacity, affordability and persistence,” Lisa Spelman. Notice, Lisa does not say High Performance … good thing that is not a goal of HANA!

Ok, enough of the facts and other analysts comments, we want speculation!  Got it. Let us speculate on what happens when memory is 5x slower (10x is just twice as bad).  Let us also assume that the 5x or 10x slower is true and that Linux does not utilize a pseudo memory mapped file system (which it does with additional overhead).

In a highly optimized and largely hypothetical world, we will only have analytics using HANA.  (Yes, I get it, that is BW or a data lake, not S/4HANA or C/4HANA but I get to determine the parameters of this made-up world.)  Let’s consider an overly simplistic query example, e.g.  select customer where revenue > 100000.  The first 64 byte block of data movement to L3 cache only takes 1us, which at current Skylake 2.5GHz speeds means a wait of 2,500 processor cycles.  For sake of argument, we are going to assume no additional latency getting into the processor, no ccNUMA effects or any other delays.  The good news is that modern architectures will predict the next access and start loading subsequent data blocks, while query processing of the first block is occurring.  Unfortunately, since the DIMM is already busy, this preload has to wait 2,500 processor cycles before it can start transferring the next block of data and 2,500 processor cycles is usually more than sufficient to start and finish any portion of the work against only 64 bytes of data.  It is really hard to imagine how queries speed will not be significantly affected by this additional latency.  Imaging taking a current HANA BW query that runs in 10 seconds and telling the users to now expect the same result in 50 or 100 seconds.  Can you imagine the revolt?

Or consider transaction processing. A typical transaction might require access to data with no-preload possible since these accesses are usually random.  So, this access gets a 5x or 10x delay which is radically faster than disk access but much slower than previously experienced with in-DRAM computing?  The trouble is that while the transaction incurs that penalty, the processor is just sitting there waiting since this is “main memory” which means that it does not issue a query and wait for an I/O interrupt to remind it to continue processing.  So, this access and every access behind it waits 2,500 cycles before continuing, assuming that everything that is required comes across in that first 64 byte chunk. Unfortunately, a transaction accesses data in rows, not in columns which means that each row transaction may involve dozens of individual column accesses, each of which will experience a 5x or 10x delay.  Now, extend that to intensely random operations such as delta merge where, instead of a sub-second interruption to individual transaction response time, there is now 5x or 10x increase on all related columnar memory writes.  I could continue to extrapolations in save points, batch and external interfaces, but you get the general idea.  At currently projected speeds, this sort of slow down in transactional performance could result in project failure.

One other point that must not be overlooked is Intel’s claim about density.  While the initial press suggested up to 10x greater density, the DIMM specifications that are currently circulating show up to 512GB per DIMM, a significant increase from the 128GB DIMM max size today (4x not 10x).  But can HANA take advantage of that increased density? Prior to Skylake, SAP certified appliances with 8-sockets could only support 8TB of memory despite many having configuration maximums of 12TB.  SAP certifications are dependent on meeting performance KPIs and there has always been a pretty direct correlation between numbers of sockets, performance per socket and amount of memory supported.  In other words, it takes more and faster cores to support more memory.  So, is it reasonable to expect that SAP will discard those KPIs and accept 5x or 10x slower speeds while also jamming 4 times as much memory per socket as is currently supported?

This is not to say that persistent memory has no place in the HANA world.  There are many places in which a 5x or 10x memory penalty is worthwhile.  Consider the case of a non-prod instance, e.g. test. If it takes 5x or 10x longer, there is little impact to the business operations of most companies, just an increase in the cost of IT and applications professionals.  This may be offset against the cost of memory and, in some cases, the math may work.  How about HA or DR? No, that does not work as HA and DR must operate like production in the case of a failure or disaster.  Certainly aged data that might otherwise reside on disk would see a radical improvement from PMEM or a radically lower cost when compared to DRAM memory in  BW extension nodes.

Also, consider that aggressive research is occurring in this field and that future technologies may reduce the penalty to only 2x the speed of DRAM.  Would that be close enough to make it worthwhile?

One final thought: The co-inventor of 3D Xpoint memory is Micron. Earlier this year, Micron and Intel decided to go their separate ways with Micron using this technology in their QuantX solutions.[v]  Micron is a member of the OpenPower Consortium.  Is it possible that they could use this technology to build their own PMEM solutions for Power Systems?  If that happens, it would certainly be fascinating to see IBM harvesting the value of PMEM without the marketing and research investment that Intel has put into this.

[i]https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=POS03155USEN

[ii]https://www.youtube.com/watch?v=O0JUCjd_t_0

[iii]https://www.pcper.com/news/Storage/Intel-Launches-Optane-DC-Persistent-Memory-DIMMs-Talks-20TB-QLC-SSDs

[iv]https://searchstorage.techtarget.com/feature/3D-XPoint-memory-stumbles-in-race-to-ditch-DRAM-RRAM-may-step-up

https://www.anandtech.com/bench/product/1967?vs=2067

[v]https://www.anandtech.com/show/12258/intel-and-micron-to-discontinue-flash-memory-partnership

https://www.micron.com/products/advanced-solutions/3d-xpoint-technology

June 22, 2018 Posted by | Uncategorized | , , , , , , , | 3 Comments

Power Systems – Delivering best of breed scalability for SAP HANA

SAP quietly revised a SAP Note last week but it certainly made a loud sound for some.  Version 47 of https://launchpad.support.sap.com/#/notes/2188482 now says that OLTP workloads, such as Suite on HANA or S/4HANA are now supported on IBM Power Systems up to 24TB.  OLAP workloads, like BW HANA may be implemented on IBM Power Systems with up to 16TB for a single scale-up instance.  As noted in https://launchpad.support.sap.com/#/notes/2055470, scale-out BW is supported with up to 16 nodes bringing the maximum supported BW environment to a whopping 256TB.

As impressive as those stats are, it should also be noted that SAP also provided new core-to-memory (CTM) guidance with the 24TB OLTP system sized at 176-cores which results in 140GB/core, up from the previous 113.7GB/core at 16TB.  The 16TB OLAP system, sized at 192-cores, translates to 85.3GB/core, up from the previous 50GB/core for 4-socket and above systems.

By comparison, the maximum supported sizes for Intel Skylake systems are 6TB for OLAP and 12TB for OLTP which correlates to 27.4GB/core OLAP and 54.9GB/core OLTP.  In other words, SAP has published numbers which suggest Power Systems can handle workloads that are  2.7x (OLAP) and 2x (OLAP) the size of the maximum supported Skylake systems.  On the CTM side, this works out to a maximum of 3.1x (OLAP) and 2.6x (OLTP) better performance per core for Power Systems over Skylake.

Full disclosure, these numbers do not represent the highest scaling Intel systems.  In order to find them, you must look at the previous generation of systems.  Some may consider them obsolete, but for customers that must scale beyond 6TB/12TB (OLAP/OLTP) and are unwilling or unable to consider Power Systems, an immediate sunk investment may be their only choice.  (Note to customers in this undesirable predicament, if you really want to get an independent, third party verification of potential obsolesence, ask your favorite leasing companies, not associated or owned by the vendor, what residual value they would assume after 1 year for these systems vs. what they would assume for similar Skylake systems after 1 year.)

The previous “generation” of HPE Superdome, “X”, which as discussed in my last blog post shares 0% technology with Skylake based HPE Superdome “Flex”, was supported up to 8TB/16TB with 384 cores for both OLAP and OLTP, resulting in CTM of 21.3GB/42.7GB/core.  The SGI derived HPE MC990 X, which is the real predecessor to the new “Flex” system, was supported up to 4TB/20TB with 192 cores OLAP with 480 cores.

Strangely, “Flex” is only supported for HANA with 2 nodes or chassis where “MC990 X” was supported with up to 5 nodes.  It has been over 4 months since “Flex” was announced and at announcement date, HPE loudly proclaimed that “Flex” could support 48TB with 8 chassis/32 sockets https://news.hpe.com/hewlett-packard-enterprise-unveils-the-worlds-most-scalable-and-modular-in-memory-computing-platform/.  Since that time, some HPE reps have been telling customers that 32TB support with HANA was imminent.  One has to wonder what the hold up is.  First it took a couple of months just to get 128GB DIMM support. Now, it is taking even longer to get more than 2-node support for HANA.  If I were a potential HPE customer, I would be very curious and asking my rep about these delays (and I would have my BS detector set to high sensitivity).

Customers have now been presented with a stark contrast.  On one side, Power Systems has been on a roll; growing market share in HANA, regular increases in supported memory sizes, the ability to handle the largest single image HANA memory sizes of any vendor, outstanding mainframe derived reliability and radically better flexibility with built in virtualization and support for a maximum of 8 concurrent production HANA instances or 7 production with many dozens of non-prod HANA, application servers, non-HANA DBs and/or a wide variety of other applications supported in a shared pool, all at competitive price points.

On the other hand, Intel based HANA systems seem to be stuck in a rut with decreased maximum memory sizes (admittedly, this may be temporary), anemic increases in CTM, improved RAS but not yet to the league of Power Systems and a very questionable VMware based virtualization support filled with caveats, limitations, overhead and poor, at best, sharing of resources.

March 28, 2018 Posted by | Uncategorized | , , , , , , , , , , , , , , , | Leave a comment

Intel Skylake has been announced and the self-described HANA “market leader”, HPE, is curiously trailing the field

Intel announced general availability of their “Skylake” processor on the “Purely” platform last week.  Soon after, SAP posted certified HANA configurations for Lenovo and Fujitsu up to 8 sockets and 12TB memory for Suite on HANA (SoH) and S/4HANA (S4) and 6TB for BW on HANA (BWoH).  They also posted certified configurations for Dell and Cisco up to 4-socket systems with 6TB SoH/S4 and 3TB BWoH.  The certified configurations posted for HPE, which describes itself as the HANA market leader, only included up to 4-socket/3TB BWoH configurations, no configurations for SoH/S4 and nothing for any larger systems.

It is still early and more certified configurations will no doubt emerge over time, but these early results do beg the question, “what is going on with HPE?”  I checked the most recent press releases for HPE and they did not even mention the Skylake debut much less their certification with SAP HANA.  If you Google using the keywords, HPE, Skylake and HANA, you may find a few discussions about HPE’s acquisition of SGI and my previous blog posts with my speculation about Superdome’s demise and HPE’s misleading of customers about this impending event, but nothing from HPE.

So, I will share a little more speculation as to what this slow start for HPE in the Skylake space might portend.

Option 1 – HPE is not investing the funds necessary to certify all of their possible configurations and SoH/S4.  Anyone that has been involved with the HANA certification process will tell you that it is very time consuming and expensive.  As you can see from HPE’s primary Intel based competitors, they are all very eager to increase their market share and acted quickly.  Is HPE becoming complacent?  Are they having financial restrictions that have not been made public?

Option 2 – HPE’s technology limitations are becoming apparent.  The Converged System 500 is based on Proliant DL560/580 systems which support a maximum of 4 sockets.  These systems utilize Intel QPI and now UPI interconnect technologies, i.e. no custom ASICs or ccNUMA switches are required.  The CS900 based on the Superdome X and the MC990 X (SGI UV 300H) utilize custom ASICs and, in the case of Superdome X, a set of ccNUMA switches.  As I speculated previously, Superdome X is probably at end of life, so it may never see another certification on SAP’s HANA site.  As to the MC990 X, the crystal ball is a bit more hazy.  Perhaps HPE is trying to shoot for the moon and hit a number beyond the 20TB for SoH/S4 that is currently supported meaning a much longer and more complex set of certification tests.  Or perhaps they are running into technical challenges with the new ASICs required to support UPI.

Option 3 – MC990 X is going to officially become HPE’s only high end offering to support Skylake and subsequent processors and Superdome X is going to be announced at end of life.  If this were to happen, it would mean that anyone that had recently purchased such a system would have purchased a system that is immediately obsolete.

If Option 1 turns out to be true, one would have to concerned about HPE’s future in the HANA space.  If Option 2 turns out to be true, one would have to be really concerned about HPE’s future in the HANA space.  And if Option 3 turns out to be true, why would HPE be waiting?  The answer may be inventory.  If HPE has a substantial inventory of “old” Broadwell based blades and Superdome X chassis, they will undoubtedly want to unload these at the highest price possible and they know that the value of obsolete systems after such an announcement would drop into the below cost of manufacturing range.

So, you pick the most likely scenario.  Worst case for HPE is that they are just a little slow or shooting too high.  Worst case for customers is that they purchase a HANA system based on Superdome X and end up with a few hundred thousand dollar boat anchor.  If you work for a company considering the purchase of an HPE Superdome X solution, you may want to ask about its future and, if you find it is at end of life, select another solution for your SAP HANA requirements.

Inevitably, more systems will be published on SAP’s certification page, https://www.sap.com/dmc/exp/2014-09-02-hana-hardware/enEN/appliances.html#viewcount=100&categories=certified%2CIntel%20Skylake%20SP .  When that happens, especially if any of my predictions turn out to be true or if they are all wrong and another scenario emerges, I will post an update.

July 20, 2017 Posted by | Uncategorized | , , , , , , , , , , , , | Leave a comment

3D XPoint Memory – The best thing for SAP HANA since HANA was invented?

At #SapphireNow, the Intel booth was all atwitter about the new “game changer”, “revolutionary”, “future of computing”, “best thing since the wheel” (ok, I made that last one up).  Yes, they were thrilled with 3D XPoint Optane memory.[i]  It is being positioned as persistent memory, like SSD but much faster and which can take the place of real, a.k.a. DRAM, memory … eventually.  Paraphrasing them, “You will be able to replace conventional memory with 3D XPoint memory at almost the same speed but which gives you the ability to restart your system after failure in a matter of seconds, not minutes or hours, because the entire HANA image will be stored in persistent memory, not on disk or SSDs.”

This sounds fantastic as long as we completely ignore reality.  Let’s dissect the above sentence.

“almost the same speed” – current speculation is that 3DXPoint memory will be about 10 times slower than conventional memory.  That is WAY better than external SSD storage, which is around 1000 times slower, but for memory resident applications, like HANA, 10 times slower will result in at least a 10x performance reduction for HANA.  Remember, we have no idea how this might affect an application which expects very fast access to memory.

“restart your system after failure:” – silly me, I thought the idea was to prevent failure in the first place.  I am curious how often system failure is caused by memory errors or any other cause for which diagnostics might be required to evaluate the underlying problem as well a repair action to fix that problem.  Then the question is in which scenario is a customer willing to circumvent diagnostics and return the system to productive use.  This also assumes that customers are willing to run mission critical systems without any sort of HA solution such as HANA System Replication or HANA Host Auto-Failover.  The use of an HA solution would fail-over production to a secondary system which means that any memory image on the primary system would be out of date almost instantly.

“restart … in seconds” – So, your system has failed for unknown reasons and you are willing to forgo any sort of evaluation of the underlying cause.  So far so good.  So, Linux is capable of restarting and keeping the memory image as it was before hand and utilizing persistent main memory? Not entirely, but with RHEL 7.3 (not supported for HANA yet), using special device drivers applications may be rewritten to utilize “pmem” for pseudo storage devices.[ii]  And HANA is capable of restarting as well from whatever point it was in at the time of failure.  Also, did not know HANA could do this and am surprised that SAP prioritized fast restart ahead of the long laundry list of customer provided requirements … which I doubt they did.  And HANA can figure out what transactions were in flight at the time of failure, which ones had made some changes to memory, but not all, e.g. started to insert data into a delta table but perhaps had not completed this action at time of failure?  Totally wicked!! … and total fantasy, at least for now.

You can easily imagine a variety of other conditions where columns are being updated, e.g. during a delta merge, but have not finished in which some columns contain updated elements and others do not.  I am not saying these are insurmountable problems, but considering that you can’t even make a change to the size of a HANA system without restarting HANA currently, it is a massive stretch to imagine how SAP has or is willing to invest the time and effort to make this work for a highly questionable benefit with likely severe performance degradation.

So, 3D XPoint memory as a replacement for conventional memory is clearly all hype, but don’t expect anyone from Intel or their proponents to tell you this.  How about as a technology for much faster SSDs?  Now we are talking!  I doubt there is any reason why this will not be quickly adopted by disk subsystem vendors and available from multiple sources.

As to whether HANA workloads will benefit, that is a different story.  Remember, HANA is a read-once workload.  Once a column is loaded into memory, it is never read again until unloaded and this should only occur if the memory subsystem is undersized or the system is restarted after maintenance.  So, fast storage is useful for restarts, but super-fast storage is only needed when a system must return to full operation after maintenance very quickly and without any performance degradation, i.e. every column loaded into memory, in 10 minutes or so.  Just as a point of comparison, IBM ran a test with 10 NVMe cards and delivered about 1TB per minute when restarting HANA.  To the best of my knowledge, few customers have expressed more than a passing interest in this capability.  I could imagine a scenario in which customers are willing to put a somewhat recent tier of data, e.g. 1 to 2 year old data, on persistent main memory, with perhaps external, and orders of magnitude slower, storage used for older data.  Once again, this is a nice concept but until SAP writes or adopts code to enable this, it is just a theory.

As to writes, most enterprise storage subsystems can deliver response times that are twice as fast as SAP requires.  IBM SVC (SAN Volume Controller) connected to an IBM Power System has been tested in real customer installations and has delivered the fastest times of any storage subsystem in the industry with a peak latency of only 161us (microseconds) for 4K block size log writes as measured by HWCCT or over 6 times better latency than what SAP requires.   SVC is part of a family of products including V7000, V9000 and Spectrum Virtualization Software which all utilize similar concepts and software.

In other words, you don’t have to wait for tomorrow to get fast restarts and minimized transactional log writes, you just need to select the write infrastructure partner, IBM.

[i] https://www.theregister.co.uk/2017/05/17/coming_xeon_sps_will_run_sap_hana_16_times_faster/
[ii] https://developers.redhat.com/blog/2016/12/05/configuring-and-using-persistent-memory-rhel-7-3/

June 12, 2017 Posted by | Uncategorized | , , , , , | 4 Comments

SAP performance report sponsored by HP, Intel and VMware shows startling results

Not often does a sponsored study show the opposite of what was intended, but this study does.  An astute blog reader alerted me about a white paper sponsored by HP, VMware and Intel by an organization called Enterprise Strategy Group (ESG).  The white paper is entitled “Lab Validation Report – HP ProLiant DL980, Intel Xeon, and VMware vSphere 5 SAP Performance Analysis – Effectively Virtualizing Tier-1 Application Workloads – By Tony Palmer, Brian Garrett, and Ajen Johan – July 2011.”  The words that they used to describe the results are, as expected, highly complementary to HP, Intel and VMware.  In this paper, ESG points out that almost 60% of the respondents to their study have not virtualized “tier 1” applications like SAP yet but expect a rapid increase in the use of virtualization.  We can only assume that they only surveyed x86 customers as 100% of Power Systems customers are virtualized since the PowerVM hypervisor is baked into the hardware and firmware of every system and can’t be removed.  Nevertheless, it is encouraging that customers are moving in the right direction and that there is so much potential for the increased use of virtualization.

 

ESG provided some amazing statistics regarding scalability.  ESG probably does not realize just how bad this makes VMware and HP look, otherwise, they probably would not have published it.  They ran an SAP ECC 6.0 workload which they describe as “real world” but for which they provide no backup as to what this workload was comprised of, so it is possible that a given customer’s workload may be even more intensive than the one tested.  They ran a single VM with 4 vcpu, then 8, 16 and 32.  They show both the number of users supported as well as the IOPS and dialog response time.  Then, in their conclusions, they state that scaling was nearly linear.  This data shows that when scaling from 4 to 32 cores, an 8x increase, the number of users supported increased from 600 to 3,000, a 5x increase.  Put a different way, 5/8 = .625 or 62.5% scalability.  Not only is this not even remotely close to linear scaling, but it is an amazing poor level of scalability.  IOPS, likewise, increased from 140 to 630 demonstrating 56.3% scalability and response time went from .2 seconds to 1 second, which while respectable, was 5 times that of the 4 vcpu VM.

 

ESG also ran a non-virtualized test with 32 physical cores.  In this test, they achieved only 4,400 users/943 IOPS.  Remember, VMware is limited to 32 vcpu which works out to the equivalent of 16 cores.  So, with twice the number of effective physical cores, they were only able to support 46.7% more users and 49.7% more IOPS.  To make matters much worse, response time almost doubled to 1.9 seconds.

 

ESG went on to make the following statement: “Considering that the SAP workload tested utilized only half of the CPU and one quarter of the available RAM installed in the DL980 tested, it is not unreasonable to expect that a single DL980 could easily support a second virtualized SAP workload at a similarly high utilization level and/or multiple less intensive workloads driven by other applications.”  If response time is already borderline poor with VMware managing only a single workload, is it reasonable to assume that response time will go up or down if you add a second workload?  If IOPS are not even keeping pace with the poor scalability of vcpu, it is reasonable to assume that IOPS will all of a sudden start improving faster?  If you have not tested the effect of running a second workload, is it reasonable to speculate what might happen under drastically different conditions?  This is like saying that on a hot summer day, an air conditioner was able to maintain a cool temperature in a sunny room with half of the chairs occupied and therefore it is not “unreasonable” to assume that it could do the same with all chair occupied.  That might be the case, but there is absolutely no supporting evidence to support such a speculation.

 

ESG further speculates that because this test utilized default values for BIOS, OS, SAP and SQL Server, performance would likely be higher with tuning.  … And my car will probably go faster if I wash it and add air to the tires, but by how much??  In summary and I am paraphrasing, ESG says that VMware, Intel processors and HP servers are ready for SAP primetime providing reliability and performance while simplifying operations and lowering costs.  Interesting that they talk about reliability yet they, once again, provide no supporting evidence and did not mention a single thing about reliability earlier in the paper other than to say that the HP DL980 G7 delivers “enhanced reliability”.  I certainly believe every marketing claim that a company makes without data to back it up, don’t you?

 

There are three ways that you can read this white paper.

  1. ESG has done a thorough job of evaluating HP x86 systems, Intel and VMware and has proven that this environment can handle SAP workloads with ease
  2. ESG has proven that VMware has either incredibly poor scalability or high overhead or both
  3. ESG has limited credibility as they make predictions for which they have no data to support their conclusions

 

While I might question how ESG makes predictions, I don’t believe that they do a poor job at performance testing.  They seem to operate like an economist, i.e. they are very good at collecting data but make predictions based on past experience, not hard data.  When is the last time that economists correctly predicted market fluctuations?  If they did, they would all be incredibly rich!

 

I think it would be irresponsible to say that VMware based environments are incapable of handling SAP workloads.  On the contrary, VMware is quite capable, but there are significant caveats.  VMware does best with small workloads, e.g. 4 to 8  vcpu, not with larger workloads e.g. 16 to 32 vcpu.  This means if a customer utilizes SAP on VMware, they will need more and smaller images than they would on excellent scaling platforms like IBM Power Systems, which drives up management costs substantially and reduces flexibility.  By way of comparison, published SAP SD 2-tier benchmark results for IBM Power Systems utilizing POWER7 technology show 99% scalability when comparing the performance of a 16-core to a 32-core system at the same MHz, 89.3% scalability when comparing a 64-core to a 128-core system with a 5% higher MHz, which when normalized to the same MHz shows 99% scalability even at this extremely high performance level.

 

The second caveat for VMware and HP/Intel systems is in that area that ESG brushed over as if it was a foregone conclusion, i.e. reliability.  Solitaire Interglobal examined data from over 40,000 customers and found that x86 systems suffer from 3 times or more system outages when comparing Linux based x86 systems to Power Systems and up to 10 times more system outages when comparing Windows based x86 systems to Power Systems.  They also found radically higher outage durations for both Linux and Windows compared to Power and much lower overall availability when looking at both planned and unplanned outages in general: http://ibm.co/strategicOS and specifically in virtualized environments: http://ibm.co/virtualizationplatformmatters.  Furthermore, as noted in my post from late last year, https://saponpower.wordpress.com/2011/08/29/vsphere-5-0-compared-to-powervm/, VMware introduces a number of single points of failure when mission critical applications demand just the opposite, i.e. the elimination of single points of failure.

 

I am actually very happy to see this ESG white paper, as it is has proven how poor VMware scales for large workloads like SAP in ways that few other published studies have ever exposed.  Power Systems continues to set the bar very high when it comes to delivering effective virtualization for large and small SAP environments while offering outstanding, mission critical reliability.  As noted in https://saponpower.wordpress.com/2011/08/15/ibm-power-systems-compared-to-x86-for-sap-landscapes/, IBM does this while maintaining a similar or lower TCO when all production, HA and non-production systems, 3 years of 24x7x365 hardware maintenance, licenses and 24x7x365 support for Enterprise Linux and vSphere 5.0 Enterprise Plus … and that analysis was done back when I did not have ESG’s lab report showing how poorly VMware scales.  I may have to revise my TCO estimates based on this new data.

October 23, 2012 Posted by | Uncategorized | , , , , , , , , , | 3 Comments