Lintel for SAP App Servers – The right choice
Or is it? Running SAP application servers on IBM Power Systems with Linux results in a lower TCA than using x86 systems with Linux and VMware. Usually, I don’t start a blog post with the conclusion, but was so amazed by the results of this analysis, that I could not help myself.
For several years now, I have seen many customers move older legacy app servers to x86 systems using Linux and VMware as well as implementing new SAP app servers on the same. When asked why, the answers boil down to cost, skills and standards. Historically, Lintel servers were not just perceived to cost less, but could the cost differences could be easily demonstrated. Students emerging from colleges have worked with Linux far more often than with UNIX and despite the fact that learning UNIX and how it is implemented in actual production environments is very little different in real effort/cost, the perception of Linux skills being more plentiful and consequently less expensive persists. Many companies and government entities have decided to standardize on Linux. For some legacy IBM Power Systems customers, a complicating factor, or perhaps a compelling factor in the analysis, has compared partitions on large enterprise class systems against low cost 2-socket x86 servers. And so, increasingly, enterprises have defaulted to Lintel as the app server of choice.
Something has changed however and is completely overturning the conventional wisdom discussed above. There is a new technology which takes advantage of all of those Linux skills on the market, obeys the Linux standards mandate and costs less than virtualized Lintel systems. What is this amazing new technology? Surprise, it is a descendent of the technology introduced in 1990 by IBM called the RS/6000, with new Linux only POWER8 systems. (OK, since you know that I am an IBM Power guy and I gave away the conclusion at the start of this post, that was probably not much of a surprise.) At least, this is what the marketing guys have been telling us and they have some impressive consultant studies and internal analyses that back up their claims.
For those of you who have been following this blog for a while, you know that I am skeptical of consultant analyses and even more so of internal analyses. So, instead of depending on those, I set out to prove, or disprove, this assertion. The journey began with setting reasonable assumptions. Actually, I went a little overboard and gave every benefit of the doubt to x86 and did the opposite for Power.
Overhead – The pundits, both internal and external, seem to suggest that 10% or more overhead for VMware is reasonable. Even VMware’s best practices guide for HANA suggests an overhead of 10%. However, I have heard some customers claim that 5% is possible. So, I decided to use the most favorable number and settled on 5% overhead. PowerVM does have overhead, but it is already baked into all benchmarks and sizings since it is built into the embedded hypervisor, i.e. it is there even when you are running a single virtual machine on a system.
Utilization – Many experts have suggested that average utilization of VMware systems range in the 20% to 30% range. I found at least one analyst that said that the best run shops can drive their VMware systems up to 45%. I selected 45%, once again since I want to give all of the benefit of the doubt to Lintel systems. By comparison, many experts suggest that 85% utilization is reasonable for PowerVM based systems, but I selected 75% simply to not give any of the benefit of the doubt to Power that I was giving to x86.
SAPS – Since we are talking about SAP app servers, it is logical to use SAP benchmarks. The best result that I could find for a 2 socket Linux Intel Haswell-EP system was posted by Dell @ 90,120 SAPS (1). A similar 2-socket server from IBM was posted @ 115,870 SAPS (2).
IBM has internal sizing tables, as does every vendor, in which it estimates the SAPS capacity of different servers based on different OSs. One of those servers, the Power S822L, a 20-core Linux only system, is estimated to be able to attain roughly 35% less SAPS than the benchmark result for its slightly larger cousin running AIX, but this takes into consideration differences in MHz, number of cores and small differences due to the compilers used for SAP Linux binaries.
For our hypothetical comparison, let us assume that a customer needs approximately the SAPS capacity as can be attained with three Lintel systems running VMware including the 5% overhead mentioned above, a sustained utilization of 45% and 256GB per server. Extrapolating the IBM numbers, including no additional PowerVM overhead and a sustained utilization of 75%, results in a requirement of two S822L systems each with 386GB.
Lenovo, HP and Dell all offer easy to use configurators on the web. I ran through the same configuration for each: 2 @ Intel Xeon Processor E5-2699 v3 18C 2.3GHz 45MB Cache 2133MHz 145W, 16 @ 16GB x4 DIMMS, 1 @ Dual-port 10GB Base-T Ethernet adapter, 2 @ 300GB 10K RPM disk (2 @ 1TB 7200 RPM for Dell) and 24x7x4 hour 3-year warranty upgrades (3). Both the Lenovo and HP sites show an almost identical number for RedHat Enterprise Linux with unlimited guests (Dell’s was harder to decipher since they apply discounts to the prices shown), so for consistency, I used the same price for RHEL including 3-yr premium subscription and support. VMware also offers their list prices on the web and the same numbers were used for each system, i.e. Version 5.5, 2-socket, premium support, 3yr (4).
The configuration for the S822L was created using IBM’s eConfig tool: 2 @ 10-core 3.42 GHz POWER8 Processor Card, 12 @ 32GB x4 DIMMS, 1 @ Dual-port 10GB Base-T Ethernet adapter, 2 @ 300GB 10K RPM disk and a 24x7x4 hour 3-year warranty upgrade, RHEL with unlimited guests and 3yr premium subscription and support and PowerVM with unlimited guests, 3yr 24×7 support (SWMA). Quick disclaimer; I am not a configuration expert with IBM’s products much less those from other companies which means there may be small errors, so don’t hold me to these numbers as being exact. In fact, if anyone with more expertise would like to comment on this post and provide more accurate numbers, I would appreciate that. You will see, however, that all three x86 systems fell in the same basic range, so small errors are likely of limited consequence.
The best list price among the Lintel vendors came in at $24,783 including the warranty upgrade. RHEL 7 came in at $9,259 and VMware @ $9,356 with a grand total for of $43,398 and for 3 systems, $130,194. For the IBM Power System, the hardware list was $33,136 including the warranty upgrade, PowerVM for Linux $10,450 and RHEL 7 $6,895 for a grand total of $51,109 and for 2 systems, $102,218.
So, for equivalent effective SAPS capacity, Lintel systems cost around $130K vs. $102K for Power … and this is before we consider the reliability and security advantages not to mention scalability, peak workload handling characteristics, reduced footprint, power and cooling. Just to meet the list price of the Power System, the Lintel vendors would have to deliver a minimum of 22% discount including RHEL and VMware.
Conclusions:
For customers making HANA decisions, it is important to note that the app server does not go away and SAP fully support heterogeneous configurations, i.e. it does not matter if the app server is on a different platform or even a different OS than the HANA DB server. This means that Linux based Power Boxes are the perfect companion to HANA DB servers regardless of vendor.
For customers that are refreshing older Power app servers, the comparisons can be a little more complicated in that there is a reasonable case to be made for running app servers on enterprise class systems potentially also housing database servers in terms of higher effective utilization, higher reliability, the ability to run app servers in an IFL (Integrated Facility for Linux) at very attractive prices, increased efficiencies and improved speeds through use of virtual Ethernet for app to DB communications. That said, any analysis should start with like for like, e.g. two socket scale-out Linux servers, and then consider any additional value that can be gained through the use of AIX (with active memory expansion) and/or enterprise class servers with or without IFLs. As such, this post makes a clear point that, in a worst case scenario, scale-out Linux only Power Systems are less expensive than x86. In a best case scenario, the TCO, reliability and security advantages of enterprise class Power Systems make the value proposition of IBM Power even more compelling.
For customers that have already made the move to Lintel, the message is clear. You moved for sound economic, skills and standards based reasons. When it is time to refresh your app servers or add additional ones for growth or other purposes, those same reasons should drive you to make a decision to utilize IBM Power Systems for your app servers. Any customer that wishes to pursue such an option is welcome to contact me, your local IBM rep or an IBM business partner.
Footnotes:
1. Dell PowerEdge R730 – 2 Processors / 36 Cores / 72 Threads 16,500 users, Red Hat Enterprise Linux 7, SAP ASE 16, SAP enhancement package 5 for SAP ERP 6.0, Intel Xeon Processor E5-2699 v3, 2.3 Ghz, 262,144MB, Cert # 2014033, 9/10/2014
2. IBM Power System S824, 4 Processors / 24 Cores / 192 Threads, 21,212 Users, AIX 7.1, DB2 10.5, SAP enhancement package 5 for SAP ERP 6.0, POWER8, 3.52 Ghz, 524,288MB, Cert # 2014016, 4/28/2014
3. https://www-01.ibm.com/products/hardware/configurator/americas/bhui/flowAction.wss?_eventId=launchNIConfigSession&CONTROL_Model_BasePN=5462AC1&_flowExecutionKey=_cF5B38036-BD56-7C78-D1F7-C82B3E821957_k34676A10-590F-03C2-16B2-D9B5CE08DCC9
http://configure.us.dell.com/dellstore/config.aspx?c=us&cs=04&fb=1&l=en&model_id=poweredge-r730&oc=pe_r730_1356&s=bsd&vw=classic
http://h71016.www7.hp.com/MiddleFrame.asp?view=std&oi=E9CED&BEID=19701&SBLID=&AirTime=False&BaseId=45441&FamilyID=3852&ProductLineID=431
vSphere 5.0 compared to PowerVM
Until recently, VMware partitions suffered from a significant scalability limitation. Each partition could scale to a maximum of 8 virtual processors (vp) with vSphere 4.1 Enterprise Edition. For many customers and uses, this did not pose much of an issue as some of the best candidates for x86 virtualization are the thousands of small, older servers which can easily fit within a single core of a modern Intel or AMD chip. For SAP customers, however, the story was often quite different. Eight vp does not equate to 8 cores, it equates to 8 processor threads. Starting with Nehalem, Intel offered HyperThreading which allowed each core to run two different OS threads simultaneously. This feature boosted throughput, on average, by about 30% and just about all benchmarks since that time have been run with HyperThreading enabled. Although it is possible to disable it, few customers elect to do so as it removes that 30% increased throughput from the system. With HyperThreading enabled, 8 VMware vp utilize 4 cores/8 threads which can be as little as 20% of the cores on a single chip. Put in simple terms, this can be as little as 5,000 SAPS depending on the version and MHz of the chip. Many SAP customers routinely run their current application servers at 5,000 to 10,000 SAPS, meaning moving these servers to VMware partitions would result in the dreaded hotspot, i.e. bad performance and a flood of calls to the help desk. By comparison, PowerVM (IBM’s Power Systems virtualization technology) partitions may scale as large as the underlying hardware and if that limit is reached, may be migrated live to a larger server, assuming one exists in the cluster, and the partition allowed to continue to operate without interruption and a much higher partition size capability.
VMware recently introduced vSphere 5.0. Among a long list of improvements is the ability to utilize 32 vp for a single partition. On the surface, this would seem to imply that VMware can scale to all but a very few large demands. Once you dig deeper, several factors emerge. As vSphere 5.0 is very new, there are not many benchmarks and even less customer experience. There is no such thing as a linearly scalable server, despite benchmarks that seem to imply this, even from my own company. All systems have a scalability knee of the curve. Where some workloads, e.g. AIM7, when tested by IBM showed up to 7.5 times the performance with 8 vp compared to 1 vp on a Xeon 5570 system with vSphere 4.0 update 1, it is worthwhile to note that this was only achieved when no other partitions were running, clearly not the reason why anyone would utilize VMware. In fact, one would expect just the opposite, that an overcommitment of CPU resources would be utilized to get the maximum throughput of a system. On another test, DayTrader2.0 in JDBC mode, a scalability maximum of 4.67 the performance of a single thread was reached with 8 vp, once again while running no other VMs. It would be reasonable to assume that VMware has done some scaling optimization but it would be premature and quite unlikely to assume that 32 vp will scale even remotely close to 4 times the performance of an 8 vp VM. When multiple VMs run at the same time, VMware overhead and thread contention may reduce effective scaling even further. For the time being, a wise customer would be well advised to wait until more evidence is presented before assuming that all scaling issues have been resolved.
But this is just one issue and, perhaps, not the most important one. SAP servers are by their very nature, mission critical. For database servers, any downtime can have severe consequences. For application servers, depending on how customers implement their SAP landscapes and the cost of downtime, some outages may not have as large the consequence. It is important to note that when an application server fails, the context for each user ‘s session is lost. In a best case scenario, the users can recall all necessary details to re-run the transactions in flight after re-logging on to another application server. This means that the only loss is the productivity of that user multiplied by the number of users previously logged on and doing productive work on that server. Assuming 500 users and 5 minutes to get logged back on, transaction initiated through to completion, this is only 2,500 minutes of lost productivity which at a loaded cost of $75,000 per employee is only a total loss to the company of $1,500 per occurrence. With one such occurrence per application server per year, this would result in $6,000 of cost over 5 years and should be included in any comparison of TCO. Of course, this does not take into consideration any IT staff time required to fix the server, any load on the help desk to help resolve issues, nor any political cost to IT if failures happen too frequently. But what happens if the users are unable to recall all of the details necessary to re-run the transactions or what happens if tight integration with production requires that manufacturing be suspended until all users are able to get back to where they had been? The costs can escalate very quickly.
So, what is my point? All x86 hypervisors, including VMware 4.1 and 5.0, are software layers on top of the hardware. In the event of an uncorrectable error in the hardware, the hypervisor usually fails and, in turn, takes down all VMs that it is hosting. Furthermore, problems are not just confined to the CPU, but could be caused by memory, power supplies, fans or a large variety of other components. I/O is yet another critical issue. VMware provides shared I/O resources to partitions, but it does this sharing within the same hypervisor. A device driver error, physical card error or, in some cases, even an external error in a cable, for example, might result in a hypervisor critical error and resulting outage. In other words, the hypervisor becomes a very large single point of failure. In order to avoid the sort of costs described above, most customers try to architect mission critical systems to reduce single points of failure not introduce new ones.
PowerVM takes the opposite approach. First, it is implemented in hardware and firmware. As the name implies, hardware is hardened meaning it is inherently more reliable and far less code is required since many functions are built into the chip.
Second, PowerVM acts primarily as an elegant dispatcher. In other words, it decides which partition executes next in a given core, but then it gets out of the way and allows that partition to execute natively in that core with no hypervisor in the middle of it. This means that if an uncorrectable error were to occur, an exceedingly rare event for Power Systems due to the wide array of fault tolerant components not available in any x86 server, in most situations the error would be confined to a single core and the partition executing in that core at that moment.
Third, sharing of I/O is done through the use of a separate partition called the Virtual I/O (VIO) server. This is done to remove this code from the hypervisor, thereby making they hypervisor more resilient and also to allow for extra redundancy. In most situations, IBM recommends that customers utilize more than one VIO server and spread I/O adapters across those servers with redundant virtual connections to each partition. This means that if an error were to occur in a VIO server, once again a very rare event, only the VIO server might fail, but the other VIO servers would not fail and there would be no impact on the hypervisor since it is not involved in the sharing of I/O at all. Furthermore, partitions would not fail since they would be multipathing virtual devices across more than one VIO server.
So even if VMware can scale beyond 8vp, the question is how much of your enterprise are you ready to place on a single x86 server? 500 users? 1,000 users? 5,000 users? Remember, 500 users calling the help desk at one time would result in long delays. 1,000 at the same time would result in many individuals not waiting and calling their LOB execs instead.
In the event that this is not quite enough of a reason to select Power and PowerVM over x86 with VMware, it is worthwhile to consider the security exposure differences. This has been covered already in a prior blog entry comparing Power to x86 servers, but is worthwhile noting again. PowerVM has no known vulnerabilities according to the National Vulnerability Database, http:// nvd.nist.gov. By comparison, a search on that web site for VMware results in 119 hits. Admittedly, this includes older versions as well as workstation versions, but it is clear that hackers have historically found weaknesses to exploit. VMware has introduced vShield with vSphere 5.0, a set of technologies intended to make VMware more secure, but only it would be prudent to wait and see if this closes all holes or opens new ones.
Also covered in the prior blog entry, the security of the hypervisor is only one piece of the equation. Equally, or perhaps more important is the security of the underlying OSs. Likewise, AIX is among the least vulnerable OSs with Linux and Windows having an order of magnitude more vulnerabilities. Also covered in that blog was a discussion about problem isolation, determination and vendor ownership of problems to drive them to successful resolution. With IBM, almost the entire stack is owned by IBM and supported for mission critical computing whereas with x86, the stack is a hodgepodge of vendors with different support agreements, capabilities and views on who should be responsible for a problem often resulting in finger pointing.
There is no question that VMware has tremendous potential for applications that are not mission critical as well as being an excellent fit for many non-production environments. For SAP, the very definition of mission critical, a more robust, more scalable and better secured environment is needed and Power Systems with PowerVM does an excellent job of delivering on these requirements.
Oh, I did not mention cost. With the new memory based pricing model for vSphere 5.0, applications such as SAP, which demand enormous quantities of memory, may easily exceed the new limits for memory pool size forcing the purchase of additional VMware licenses. Those extra license costs and their associated maintenance, can easily add enough cost that the price differences, if there are any, between Power and x86 further close to be almost meaningless.