TDI Phase 5 – SAPS based sizing bringing better TCO to new and existing Power Systems customers
SAP made a fundamental and incredibly important announcement this week at SAP TechEd in Las Vegas: TDI Phase 5 – SAPS based sizing for HANA workloads. Since its debut, HANA has been sized based on a strict memory to core ratio determined by SAP based on workloads and platform characteristics, e.g. generation of processor, MHz, interconnect technology, etc. This might have made some sense in the early days when much was not known about the loads that customers were likely to experience and SAP still had high hopes for enabling all customer employees to become knowledge workers with direct access to analytics. Over time, with very rare exception, it turned out that CPU loads were far lower than the ratios might have predicted.
I have only run into one customer in the past two years that was able to drive a high utilization of their HANA systems and that was a customer running an x86 BW implementation with an impressively high number of concurrent users at one point in their month. Most customers have experienced just the opposite, consistently low utilization regardless of technology.
For many customers, especially those running x86 systems, this has not been an issue. First, it is not a significant departure from what many have experienced for years, even those running VMware. Second, to compensate for relatively low memory and socket-to-socket bandwidth combined with high latency interconnects, many x86 systems work best with an excess of CPU. Third, many x86 vendors have focused on HANA appliances which are rarely utilized with virtualization and are therefore often single instance systems.
IBM Power Systems customers, by comparison, have been almost universal in their concern about poor utilization. These customers have historically driven high utilization, often over 65%. Power has up to 5 times the memory bandwidth per socket of x86 systems (without compromising reliability) and very wide and parallel interconnect paths with very low latencies. HANA has never been offered as an appliance on Power Systems, instead being offered only using a Tailored Datacenter Infrastructure (TDI) approach. As a result, customers view on-premise Power Systems as a sort of utility, i.e. that they should be able to use them as they see fit and drive as much workload through them as possible while maintaining the Service Level Agreements (SLA) that their end users require. The idea of running a system at 5%, or even 25%, utilization is almost an affront to these customers, but that is what they have experienced with the memory to core restrictions previously in place.
IBM’s virtualization solution, PowerVM, enabled SAP customers to run multiple production workloads (up to 8 on the largest systems) or a mix of production workloads (up to 7) with a shared pool of CPU resources within which an almost unlimited mix of VMs could run including non-prod HANA, application servers, as well as non-SAP and even other OS workloads, e.g. AIX and IBM i. In this mixed mode, some of the excess CPU resource not used by the production workloads could be utilized by the shared-pool workloads. This helped drive up utilization somewhat, but not enough for many.
These customers would like to do what they have historically done. They would like to negotiate response time agreements with their end user departments then size their systems to meet those agreements and resize if they need more capacity or end up with too much capacity.
The newly released TDI Overview document http://bit.ly/2fLRFPb describes the new methodology: “SAP HANA quicksizer and SAP HANA sizing reports have been enhanced to provide separate CPU and RAM sizing results in SAPS”. I was able to verify Quicksizer showing SAPS, but not the sizing reports. An SAP expert I ran into at TechEd suggested that getting the sizing reports to determine SAPS would be a tall order since they would have to include a database of SAPS capacity for every system on the market as well as number of cores and MHz for each one. (In a separate blog post, I will share how IBM can help customers to calculate utilized SAPS on existing systems). Customers are instructed to work with their hardware partner to determine the number of cores required based on the SAPS projected above. The document goes on to state: “The resulting HANA TDI configurations will extend the choice of HANA system sizes; and customers with less CPU intensive workloads may have bigger main memory capacity compared to SAP HANA appliance based solutions using fixed core to memory sizing approach (that’s more geared towards delivery of optimal performance for any type of a workload).”
Using a SAPS based methodology will be a good start and may result in fewer cores required for the same workload as would have been previously calculated based on a memory/core ratio. Customers that wish to allocate more of less CPU to those workloads will now have this option meaning that even more significant reduction of CPU may be possible. This will likely result in much more efficient use of CPU resources, more capacity available to other workloads and/or the ability to size systems with less resources to drive down the cost of those systems. Either way helps drive much better TCO by reducing numbers and sizes of systems with the associated datacenter and personnel costs.
Existing Power customers will undoubtedly be delighted by this news. Those customers will be able to start experimenting with different core allocations and most will find they are able to decrease their current HANA VM sizes substantially. With the resources no longer required to support production, other workloads currently implemented on external systems may be consolidated to the newly, right sized, system. Application servers, central services, Hadoop, HPC, AI, etc. are candidates to be consolidated in this way.
Here is a very simple example: A hypothetical customer has two production workloads, BW/4HANA and S/4HANA which require 4TB and 3TB respectively. For each, HA is required as is Dev/Test, Sandbox and QA. Prior to TDI Phase 5, using Power Systems, the 4TB BW system would require roughly 82-cores due to the 50GB/core ratio and the S/4 workload would require roughly 33 cores due to the 96GB/core ratio. Including HA and non-prod, the systems might look something like:
Note the relatively small number of cores available in the shared pool (might be less than optimal) and the total number of cores in the system. Some customers may have elected to increase to an even larger system or utilize additional systems as a result. As this stood, this was already a pretty compelling TCO and consolidation story to customers.
With SAPS based sizing, the BW workload may require only 70 cores and S/4 21 cores (both are guesses based on early sizing examples and proper analysis of the SAP sizing reports and per core SAPS ratings of servers is required to determine actual core requirements). The resulting architecture could look like:
Note the smaller core count in each system. By switching to this methodology, lower cost CPU sockets may be employed and processor activation costs decreased by 24 cores per system. But the number of cores in the shared pool remains the same, so still could be improved a bit.
During a landscape session at SAP TechEd in Las Vegas, an SAP expert stated that customers will be responsible for performance and CPU allocation will not be enforced by SAP through HWCCT as had been the case in the past. This means that customers will be able to determine the number of cores to allocate to their various instances. It is conceivable that some customers will find that instead of the 70 cores in the above example, 60, 50 or fewer cores may be required for BW with decreased requirements for S/4HANA as well. Using this approach, a customer choosing this more hypothetical approach might see the following:
Note how the number of cores in the shared pool have increased substantially allowing for more workloads to be consolidated to these systems, further decreasing costs by eliminating those external systems as well as being able to consolidate more SAN and Network cards, decreasing computer room space and reducing energy/cooling requirements.
A reasonable question is whether these same savings would accrue to an x86 implementation. The answer is not necessarily. Yes, fewer cores would also be required, but to take advantage of a similar type of consolidation, VMware must be employed. And if VMware is used, then a host of caveats must be taken into consideration. 1) overhead, reportedly 12% or more, must be added to the capacity requirements. 2) I/O throughput must be tested to ensure load times, log writes, savepoints, snapshots and backup speeds which are acceptable to the business. 3) limits must be understood, e.g. max memory in a VM is 4TB which means that BW cannot grow by even 1KB. 4) Socket isolation is required as SAP does not permit the sharing of a socket in a HANA production/VMware environment meaning that reducing core requirements may not result in fewer sockets, i.e. this may not eliminate underutilized cores in an Intel/VMware system. 5) Non-prod workloads can’t take advantage of capacity not used by production for several reasons not the least of which is that SAP does not permit sharing of sockets between VM prod and non-prod instances not to mention the reluctance of many customer to mix prod and non-prod using a software hypervisor such as VMware even if SAP permitted this. Bottom line is that most customers, through an abundance of caution, or actual experience with VMware, choose to place production on bare-metal and non-prod, which does not require the same stack as prod, on VMware. Workloads which do require the same stack as prod, e.g. QA, also are usually placed on bare-metal. After closer evaluation, this means that TDI Phase 5 will have limited benefits to x86 customers.
This announcement is the equivalent of finally being allowed to use 5th gear on your car after having been limited to only 4 for a long time. HANA on IBM Power Systems already had the fastest adoption in recent SAP history with roughly 950 customers selecting HANA on Power in just 2 years. TDI Phase 5 uniquely benefits Power Systems customers which will continue the acceleration of HANA on Power. Those individuals that recommended or made decisions to select HANA on Power will look like geniuses to their CFOs as they will now get the equivalent of new systems capacity at no cost.
Certified vs. supported HANA solutions; what’s the difference and why should you care
There seems to be a lot of confusion about the terms “Certified” and “Supported” in the HANA context. Those are not qualitative terms but more of a definition of how solutions are put together and delivered. SAP recognized that HANA was such a new technology, back in 2011, and had so many variables which could impact performance and support, that they asked technology vendors to design appliances which SAP could review, test and ensure that all performance characteristics met SAP’s KPIs. Furthermore, with a comprehensive understanding of what was included in an appliance, SAP could offer a one-stop-shop approach to support, i.e. if a customer has a problem with a “Certified” appliance, just call SAP and they will manage the problem and work with the technology vendor to determine where the problem is, how to fix it and drive it to full resolution.
Sounds perfect, right? Yes … as long as you don’t need to make any modifications as business needs change. Yes … as long as you don’t mind the system running at low utilization most of the time. Yes … as long as the systems, storage and interconnects that are included in the “certified” solution match the characteristics that you consider important, are compatible with your IT infrastructure and allow you to use the management tools of your choice.
So, what is the option? SAP introduced TDI, the Tailored Datacenter Integration approach. It allows customers to put together HANA environments in a more flexible manner using a customer defined set of components (with some restrictions) which meet SAP’s performance KPIs. What is the downside? Meeting those KPIs and problem resolution are customer responsibilities. Sounds daunting, but it is not. Fortunately, SAP doesn’t just say, go forward and put anything together that you want. Instead, they restrict servers and storage subsystems to those for which internal or external performance tests have been completed to SAP standards. This allows reasonable ratios to be derived, e.g. the memory to core ratio for various types of systems and HANA implementation choices. Some restrictions do apply, for example Intel Haswell-EX environments must utilize systems which have been approved for use in appliances and Haswell-EP and IBM Power Systems environments must use systems listed on the appropriate “supported” tabs of the official HANA infrastructure support guide.[i] Likewise, Certified Enterprise storage subsystems are also listed, but this does not rule out the use of internal drives for TDI solutions.
Any HANA solution, whether an appliance or a TDI defined system, is equally capable of handling the HANA workload which falls within the maximums that SAP has identified. SAP will support HANA on any of the above. As to the full solution support, as mentioned previously, this is a customer responsibility. Fortunately, vendors, such as IBM, offer a one-stop-shop support package. IBM calls its package Custom Technical Support (US) or Total Solution Service (outside of US). Similar to the way that SAP supports an appliance, with this offering, a customer need call only one number for support. IBM’s support center will then work with the customer to do problem determination and problem source identification. When problems are determined to be caused by IBM, SAP or SUSE products, warm transfers are made to those groups. The IBM support center stays engaged even after the warm transfer occurs to ensure the problem is resolved and delivered to the customer. In addition, customers may benefit from optional proactive services (on-site or remote) to analyze the system in order to receive recommendations to keep the system up to date and/or to perform necessary OS, firmware or hardware updates and upgrades. With these proactive support offerings, customers can ensure that the HANA systems are maintained in line with SAP’s planned release calendar and are fully prepared for future upgrades.
There are a couple of caveats however. Since TDI permits the use of the customer’s preferred network and storage vendors, these sorts of support offerings typically encompass only the vendors’ products that are within the scope of the warm transfer agreements of each offering vendor. As a result, a problem with a network switch or a third-party storage subsystem for which the proactive support vendor does not have a warm transfer support agreement would still be the responsibility of the customer.
So, should a customer choose a “certified” solution or a TDI supported solution? The answer depends on the scope of the HANA implementation, the customer’s existing standards, skills and desire to utilize them, the flexibility with resource utilization and instance placement desired and, of course, cost.
Scope – If HANA is used as a side-car, a small BW environment, or perhaps for Business One, an appliance can be a viable option, especially if the HANA solution will be located in a setting for which local skilled personnel are not readily available. If, however, the HANA environment is more complex, e.g. BW scale-out, SoH, S/4, large, etc, and located in a company’s main data centers with properly skilled individuals, then a TDI supported approach may be more desirable.
Standards – Many customers have made large investments in network infrastructure, storage subsystems and the tools and skills necessary to manage them. Appliances that include components which are not part of those standards not only bring in new devices that are unfamiliar to the support staff, but may be largely invisible to the tools currently in use.
Flexibility – Appliances are well defined, single purpose devices. That definition includes a fixed amount of memory, CPU resources, I/O adapters, SSD and/or HDD devices. Simple to order, inflexible to change. If a different amount of any of the above resources is desired, in theory, any change permitted by the offering vendor results in the device moving from a SAP supported appliance to a TDI supported configuration instantly requiring the customer to accept responsibility for everything just as quickly. By comparison, TDI supported solutions start out as a customer responsibility meaning it has been tailored around the customer’s standards and can be modified as desired at any time. All that is required for support is to run SAP’s HWCCT (Hardware Configuration Check Tool) to ensure that the resulting configuration still meets all SAP KPIs. As a result, if a customer desires to virtualize, mixing multiple production and non-production or even non-SAP workloads (when supported by the chosen virtualization solution, see my blog post on VMware and HANA recently published), a TDI solution, vendor and technology dependent, supports this by definition; an appliance does not. Likewise, a change in capacity, e.g. physical addition/removal of components, logical change of capacity, often called Capacity on Demand and VM resizing, are fully supported with TDI, not with appliances. As a result, once a limit is reached with an appliance, either a scale-out approach much be utilized, in the case of analytics that support scale-out, or the appliance must be decommissioned and replaced with a larger one. A TDI solution will available additional capacity or the ability to add additional gives the customer the ability to upgrade in place, thereby providing greater investment protection.
Cost – An appliance trades simplicity with potentially higher TCO as it is designed to meet the above mentioned KPIs without a comprehensive understanding of what workload will be handled by said appliance, often resulting in dramatic over-capacity, e.g. uses dozens of HDDs to meet disk throughput requirements. By comparison, customers with existing enterprise storage subsystems, may need only a few additional SSD and/or HDDs to meet those KPIs with limited incremental cost to infrastructure, environmentals and support. Likewise, the ability to use fewer, potentially larger systems with the ability to co-reside production, non-prod, app servers, non-SAP VMs, HA, etc., can result in significant reductions of systems footprints, power, cooling, management and associated costs.
IBM Power Systems has chosen to take the TDI-only approach as a direct result of feedback received from customers, especially enterprise customers, that are used to managing their own systems, have available skills, have prevalent IT standards, etc. HANA on Power is based on virtualization, so is designed, by default, to be a TDI based solution. HANA on Power allows for one or many HANA instances, a mixture of prod and potentially non-prod or non-HANA to share the system. HA and DR can be mixed with other prod and non-prod instances.
I am often asked about t-shirt sizes, but this is a clear indication that the individual asking the question has a mindset based on appliance sizing. TDI architecture involves landscape sizing and encompasses all of the various components required to support the HANA and non-HANA systems, so a t-shirt sizing would end up being completely misleading unless a customer only needs a single system, no HA, no DR, no non-prod, no app servers, etc.
[i] http://global.sap.com/community/ebook/2014-09-02-hana-hardware/enEN/appliances.html