Scale-up vs. scale-out architectures for SAP HANA – part 2
S/4HANA is enabled for scale-out up to 4 nodes plus one hot-standby. Enablement does not mean it is easy or advisable. SAP states clearly: “We recommend using scale-up configurations as long as this is economically justifiable, taking operational costs and drawbacks into account.”[i] This same note goes on to say: “Limited knowledge about S/4HANA customer scenarios using scale-out is currently available.”
For very large customers, e.g. those for which an S/4HANA system’s memory is predicted to be larger than 24TB currently, scale-out may be the best option. Best, of course, implies that there may be other options which will be discussed later in this post.
It is a reasonable question to ask why does SAP offer such conditional advice. We can only speculate since SAP does not provide a direct explanation. Some insight may be gained by reading the SAP note on scale-out sizing.[ii] Unlike analytical applications such as BW/4HANA, partitioning of S/4HANA tables across nodes is not permitted. Instead, all tables of a particular module are grouped together and the entire group must be placed on an individual node in the cluster.
Let’s consider a simple example of three commonly used modules, FI, MM and SD (Financial, Materials, Sales). The tables associated with each module belong to their respective groups. Placing each on a different node may help to minimize the size of any one node, but several issues arise.
- Each will probably be a different size. This fully supported, but the uneven load distribution may result in one node running at high utilization while another is barely using any capacity. Not only does this mean wasted computing power, power and cooling but could result in inferior performance on the hot node.
- Since most customers prefer to size all nodes in a cluster the same way, considerable over capacity of memory might result further driving up infrastructure costs.
- Transactions often do not fit comfortably within a single module, e.g. a sales order might result in financial tables being updated with billing, accounts receivable and revenue data and materials tables being adjusted with a decrement of available stock. If a transaction is running on node 1 (the master node) and needs to access/update tables on nodes 2 and 3, those communications run across a network. As with the BW example in the previous blog post, each communication is at least 30 times slower across a network than across memory.
It is important to consider that every transaction that comes into an S/4HANA system connects to the index server on the master node with queries distributed by the master node to the appropriate index server. This means that every transaction not handled directly by the master node must involve at least one send and one receive with the associated 30 times slower latency.
Some cross-node latency may be reduced by collocating appropriate groups resulting in fewer total nodes and/or by replicating some tables. Unfortunately, if a table is replicated, this would result in the breaking of a fundamental SAP rule noted in SAP Note 2408419 (see footnote #1 below), that all tables of a group must be located on the same node.
As with the BW example, what works well for one scenario may not work well for another. One of the significant advantages of S/4HANA over Business Suite 7 is the consolidation and dramatic reduction of tables resulting in fewer, much larger tables. Conversely, this makes table distribution in a scale-out cluster much more challenging. It is not hard to imagine that performance management could be quite a task in a scale-out scenario.
So, if scale-out is not an option for many/most customers, what should be done if approaching a significant memory barrier? Options include:
- Cleanup, use of hybrid LOBs, index optimization, etc
- Archiving data to reduce the size of the system
- Eliminating duplicate data or easily reproduced data, e.g. iDocs, data from Hadoop
- Usage of Data Aging[iii]
- Sizing memory smaller than predicted
- Request an exception to size the system larger than officially supported
Cleaning up your system and getting rid of various unnecessary memory consumers should be the first approach undertaken.[iv] Remember, what might have been important with a conventional DB may either be not needed with S/4HANA or a better technique may exist. The expected memory reduction is usually shown as part of an ERP sizing report.
Archiving is another obvious approach but since the data is kept on very slow media, compared to in-memory data, and cannot be changed, the decision as to what to archive and where to place it can be very challenging for some organizations.
iDocs are, by definition, intermediate documents and are used primarily for sending and receiving documents to/from third parties, e.g. sales orders, purchase orders, invoices, shipping notices. Every iDoc sent or received should have a corresponding transaction within the SAP system which means that it is essentially a duplicate record once processed by the SAP system. Many customers keep these documents indefinitely just in case any disputes occur with those third parties. Often, these iDocs just sit around collecting digital dust and may be prime candidates for deletion or archival. Likewise, data from an external source, e.g. Hadoop, should still exist in that source and could potentially be deleted from HANA.
Data Aging only covers a subset of data objects and requires some effort to utilize it.[v] By default, the ABAP server adds “WITH RANGE RESTRICTION (‘CURRENT’)” to all queries to prevent unintended access to aged or cold partitions, which means that to access aged data, a query must specify which aged partition to access. This implies special transactions or at least different training for users to access aged data. Data Aging does allow aged data to be updated, so may be more desirable than archiving in some cases. Aged data is stored on storage devices which means many order of magnitude slower than memory, however this can be mitigated, to some extent by faster media, e.g. NVMe drives on PCIe cards. Unfortunately, Data Aging has not been implemented by many customers meaning a potentially steep learning curve.
Deliberately undersizing a system is not recommended by SAP and I am not recommending it either. That said, if an implementation is approaching a memory boundary and scaling to a larger VM or platform is not possible (physically, politically or financially), then this technique may be considered. It comes with some risk, however, so should be considered a last resort only. HANA enables “lazy loading” of columns[vi] whereby columns are not loaded until needed. If your system has a large number of columns which consume space on disk but are never or rarely accessed, the memory reserved for these columns will, likewise, go unused or underused. HANA also will attempt to unload columns when the system runs out of allocable memory based on a least frequently used algorithm. Unless a problem occurs, a system configured with less memory than the sizing report predicts will start without problem and unload columns when needed. The penalty comes when those columns that are not memory resident are accessed at which time other column(s) must first be unloaded and the entire requested column loaded, i.e. significant latency for the first access is incurred. As mentioned earlier, this should be considered only in a worst case scenario and only if scaling up/out is not desired or an option.
Lastly, requesting an exception from SAP to allow a system size greater than officially supported may be a viable choice for the customers that are expected to exceed current maximums. This may not be without difficulty as when you embark on a journey where few or none have gone before, inevitably, you will run into obstacles that others have not yet encountered. Dispatch mechanisms, delta merge operations, transactional log latency, savepoint I/O throughput, system startup times, backup/recovery and system replication are among some of the more significant areas that would be stressed and some might break.
My advice: Scale-up only in all S/4HANA cases unless the predicted memory for the immediate planning horizon exceeds the official SAP maximum supported size. Before considering scale-out solutions, use every available tool to reduce the size of the system and ask SAP for an exception if the resulting size is still above the maximum. Lastly, remember that SAP and its hardware partners are constantly working to enable larger HANA system sizes. If the size required today fits within the largest supported system but is expected to exceed the limit over time, it may be reasonable to start your implementation or migration effort today with the expectation that the maximum will be increased by the time you need it. Admittedly, this is taking a risk, but one that may be tolerable and if the limit is not raised in time, scale-out is still an option.
[i]2408419 – SAP S/4HANA – Multi-Node Support
[ii]2428711 – S/4HANA Scale-Out Sizing
[iii]2416490 – FAQ: SAP HANA Data Aging in SAP S/4HANA
[iv]1999997 – FAQ: SAP HANA MemoryFAQ 5
[v]1872170 – Business Suite on HANA and S/4HANA sizing report.
[vi]https://www.sap.com/germany/documents/2016/08/205c8299-867c-0010-82c7-eda71af511fa.html
Power Systems – Delivering best of breed scalability for SAP HANA
SAP quietly revised a SAP Note last week but it certainly made a loud sound for some. Version 47 of https://launchpad.support.sap.com/#/notes/2188482 now says that OLTP workloads, such as Suite on HANA or S/4HANA are now supported on IBM Power Systems up to 24TB. OLAP workloads, like BW HANA may be implemented on IBM Power Systems with up to 16TB for a single scale-up instance. As noted in https://launchpad.support.sap.com/#/notes/2055470, scale-out BW is supported with up to 16 nodes bringing the maximum supported BW environment to a whopping 256TB.
As impressive as those stats are, it should also be noted that SAP also provided new core-to-memory (CTM) guidance with the 24TB OLTP system sized at 176-cores which results in 140GB/core, up from the previous 113.7GB/core at 16TB. The 16TB OLAP system, sized at 192-cores, translates to 85.3GB/core, up from the previous 50GB/core for 4-socket and above systems.
By comparison, the maximum supported sizes for Intel Skylake systems are 6TB for OLAP and 12TB for OLTP which correlates to 27.4GB/core OLAP and 54.9GB/core OLTP. In other words, SAP has published numbers which suggest Power Systems can handle workloads that are 2.7x (OLAP) and 2x (OLAP) the size of the maximum supported Skylake systems. On the CTM side, this works out to a maximum of 3.1x (OLAP) and 2.6x (OLTP) better performance per core for Power Systems over Skylake.
Full disclosure, these numbers do not represent the highest scaling Intel systems. In order to find them, you must look at the previous generation of systems. Some may consider them obsolete, but for customers that must scale beyond 6TB/12TB (OLAP/OLTP) and are unwilling or unable to consider Power Systems, an immediate sunk investment may be their only choice. (Note to customers in this undesirable predicament, if you really want to get an independent, third party verification of potential obsolesence, ask your favorite leasing companies, not associated or owned by the vendor, what residual value they would assume after 1 year for these systems vs. what they would assume for similar Skylake systems after 1 year.)
The previous “generation” of HPE Superdome, “X”, which as discussed in my last blog post shares 0% technology with Skylake based HPE Superdome “Flex”, was supported up to 8TB/16TB with 384 cores for both OLAP and OLTP, resulting in CTM of 21.3GB/42.7GB/core. The SGI derived HPE MC990 X, which is the real predecessor to the new “Flex” system, was supported up to 4TB/20TB with 192 cores OLAP with 480 cores.
Strangely, “Flex” is only supported for HANA with 2 nodes or chassis where “MC990 X” was supported with up to 5 nodes. It has been over 4 months since “Flex” was announced and at announcement date, HPE loudly proclaimed that “Flex” could support 48TB with 8 chassis/32 sockets https://news.hpe.com/hewlett-packard-enterprise-unveils-the-worlds-most-scalable-and-modular-in-memory-computing-platform/. Since that time, some HPE reps have been telling customers that 32TB support with HANA was imminent. One has to wonder what the hold up is. First it took a couple of months just to get 128GB DIMM support. Now, it is taking even longer to get more than 2-node support for HANA. If I were a potential HPE customer, I would be very curious and asking my rep about these delays (and I would have my BS detector set to high sensitivity).
Customers have now been presented with a stark contrast. On one side, Power Systems has been on a roll; growing market share in HANA, regular increases in supported memory sizes, the ability to handle the largest single image HANA memory sizes of any vendor, outstanding mainframe derived reliability and radically better flexibility with built in virtualization and support for a maximum of 8 concurrent production HANA instances or 7 production with many dozens of non-prod HANA, application servers, non-HANA DBs and/or a wide variety of other applications supported in a shared pool, all at competitive price points.
On the other hand, Intel based HANA systems seem to be stuck in a rut with decreased maximum memory sizes (admittedly, this may be temporary), anemic increases in CTM, improved RAS but not yet to the league of Power Systems and a very questionable VMware based virtualization support filled with caveats, limitations, overhead and poor, at best, sharing of resources.