SAPonPower

An ongoing discussion about SAP infrastructure

What should you do when your LoBs say they are not ready for S/4HANA – part 2, Why choice of Infrastructure matters

Conversions to S/4HANA usually take place over several months and involve dozens to hundreds of steps.  With proper care and planning, these projects can run on time, within budget and result in a final Go-Live that is smooth and occurs within an acceptable outage window.  Alternately, horror stories abound of projects delayed, errors made, outages far beyond what was considered acceptable by the business, etc.  The choice of infrastructure to support a conversion may the last thing on anyone’s mind, but it can have a dramatic impact on achieving a successful outcome.

A conversion includes running many pre-checks[i] which can run for quite a while[ii] which implies that they can drive CPU utilization to high levels for a significant duration and impact other running workloads.  As a result, consultants routinely recommend that you make a copy of any system, especially production, against which you will run these pre-checks.  SAP recommends that you run these pre-checks against every system to be converted, e.g. Development, Test, Sandbox, QA, etc.  If they are being used for on-going work, it may be advisable to also make a copy of them and run those copies on other systems or within virtual machines which can be limited to avoid causing performance issues with other co-resident virtual machines.

In order to find issues and correct them, conversion efforts usually involve a phased approach with multiple conversions of support systems, e.g. Dev, Test, Sandbox, QA using a tool such as SAP’s Systems Update Manager with the Database Migration Option (SUM w/DMO).  One of the goals of each run is to figure out how long it will take and what actions need to be taken to ensure the Go-Live production conversion completes within the required outage window, including any post-processing, performance tuning, validation and backups.

In an attempt to keep expenses low, many customers will choose to use existing systems or VMs or systems in addition to a new “target” system or systems if HA is to be tested.  This means that the customer’s network will likely be used in support of these connections.  Taken together, the use of shared infrastructure components may come into opposition with events among the shared components which can impacts these tests and activities.  For example, if a VM is used but not enough CPU or network bandwidth is provided, the duration of the test may extend well beyond what is planned meaning more cost for per-hour consulting and may not provide the insight into what needs to be fixed and how long the actual migration may take.  How about if you have plenty of CPU capacity or even a dedicated system, but the backup group decides to initiate a large database backup at the same time and on the same network as your migration test is using.  Or maybe, you decide to run a test at a time that another group, e.g. operations, needs to test something that impacts it or when new equipment or firmware is being installed and modifications to shared infrastructure are occurring, etc.  Of course, you can have good change management and carefully arrange when your conversion tests will occur which means that you may have restricted windows of opportunity at times that are not always convenient for your team.

Let’s not forget that the application/database conversion is only one part of a successful conversion.  Functional validation tests are often required which could overwhelm limited infrastructure or take it away from parallel conversion tasks.  Other easily overlooked but critical tasks include ensuring all necessary interfaces work; that third party middleware and applications install and operate correctly; that backups can be taken and recovered; that HA systems are tested with acceptable RPO and RTO; that DR is set up and running properly also with acceptable RPO and RTO.  And since this will be a new application suite with different business processes and a brand new Fiori interface, training most likely will be required as well.

So, how can the choice of infrastructure make a difference to this almost overwhelming set of issues and requirements?  It comes down to flexibility.  Infrastructure which is built on virtualization allows many of these challenges to be easily addressed.  I will use an existing IBM Power Systems customer running Oracle, DB2 or Sybase to demonstrate how this would work.

The first issue dealt with running pre-checks on existing systems.  If those existing systems are Power Systems and enough excess capacity is available, PowerVM, the IBM Virtualization Hypervisor, allows a VM to be  started with an exact copy of production, passed through normal post-processing such as BDLS to create a non-production copy or cloned and placed behind a network firewall.  This VM could be fenced off logically and throttled such that production running on the same system would always be given preference for cpu resources.  By comparison, a similar database located on an x86 system would likely not be able to use this process as the database is usually running on bare-metal systems for which no VM can be created.

Alternately, for Power Systems, the exact same process could be utilized to carve out a VM on a new HANA target system and this is where the real value starts to emerge.  Once a copy or clone is available on the target HANA on Power system, as much capacity can be allocated to the various pre-checks and related tasks as needed without any concern for the impact on production or the need to throttle these processes thereby optimizing the duration of these tasks.  On the same system, HANA target VMs may be created.  As mock conversions take place, an internal virtual network may be utilized.  Not only is such a network faster by a factor of 2 or more, but it is dedicated to this single purpose and completely unaffected by anything else going on within the datacenter network.  No coordination is required beyond the conversion team which means that there is no externally imposed delay to begin a test or constraints on how long such a test may take or, for that matter, how many times such a test may be run.

The story only gets better.  Remember, SAP suggests you run these pre-checks on all non-prod landscapes.  With PowerVM, you may fire up any number of different copy/clone VMs and/or HANA VMs.  This means that as you move from one phase to the next, from one instance to the next, from conversion to production, run a static validation environment while other tasks continue, conduct training classes, run many different phases for different projects at the same time, PowerVM enables the system to respond to your changing requirements.  This helps avoid the need to purchase extra interim systems and buy when needed, not significantly ahead of time due to the inflexibility of other platforms.  You can even simulate an HA environment to allow you to test your HA strategy without needing a second system, up to the physical limits of the system, of course.  This is where a tool like SAP’s TDMS, Test Data Migration Server, might come in very handy.

And when it comes time for the actual Go-Live conversion, the running production database VM may be moved live from the “old” system, without any downtime, to the “new” system and the migration may now proceed using the virtual, in-memory network at the fastest possible speed and with all external factors removed.  Of course, if the “old” system is based on POWER8, it may then be used/upgraded for other HANA purposes.  Prior Power Systems as well as current generation POWER8 systems can be used for a wide variety of other purposes, both SAP and those that are not.

Bottom line: The choice of infrastructure can help you eliminate external influences that cause delays and complications to your conversion project, optimize your spend on infrastructure, and deliver the best possible throughput and lowest outage window when it comes to the Go-Live cut-over.  If complete control over your conversion timeline was not enough, avoidance of delays keeps costs for non-fixed cost resources to a minimum.  For any SAP customer not using Power Systems today, this flexibility can provide enormous benefits, however the process of moving between systems would be somewhat different.  For any existing Power Systems customer, this flexibility makes a move to HANA on Power Systems almost a no-brainer, especially since IBM has so effectively removed TCA as a barrier to adoption.

[i] https://blogs.sap.com/2017/01/20/system-conversion-to-s4hana-1610-part-2-pre-checks/

[ii] https://uacp.hana.ondemand.com/http.svc/rc/PRODUCTION/pdfe68bfa55e988410ee10000000a441470/1511%20001/en-US/CONV_OP1511_FPS01.pdf page 19

Advertisements

April 26, 2017 Posted by | Uncategorized | , , , , , , , , , | Leave a comment

SAP HANA on Power support expands dramatically

SAP’s release of HANA SPS11 marks a critical milestone for SAP/IBM customers. About a year ago, I wrote that there was Hope for HoP, HANA on Power.  Some considered this wishful thinking, little more than a match struck in the Windy City.  In August, that hope became a pilot light with SAP’s announcement of General Availability of Scale-up BW HANA running on the Power Systems platform.  Still, the doubters questioned whether Power could make a dent in a field already populated by dozens of x86 vendors with hundreds of supported appliances and thousands of installed customers.  With almost 1 new customer per business day deciding to implement HANA on Power since that time, the pilot light has quickly evolved into a nice strong flame on a stove.

In November, 2015, SAP unleashed a large assortment of support for HoP.  First, they released a first of a kind support for running more than 1 production instance using virtualization on a system.[1]  For those that don’t recall, SAP limits systems running HANA in production on VMware to one[2], count that as 1, total VMs on the entire system.  Yes, non-prod can utilize VMware to its heart’s content, but is it wise to mess with best practices and utilize different stacks for prod and non-prod, much less deal with restrictions that limit the number of vps to 64, i.e. 32 real processors not counting VMware overhead and 1TB of memory?  Power now supports up to 4 resource pools on E870 and E880 systems and 3 on systems below this level.  One of those resource pools can be a “shared pool” supporting many VMs of any kind and any supported OS as long as none of them run production HANA instances.  Any production HANA instance must run in a dedicated or dedicated-donating partition in which when production HANA needs CPU resources, it gets it without any negotiation or delay, but when it does not require all of the resources, it allows partitions in the shared pool to utilize unused resources.   This is ideal for HANA as it is often characterized by wide variations in loads, often low utilization and very low utilization on non-prod, HA and DR systems, resulting in the much better flexibility and resource utilization (read that as reduced cost).

But SAP did not stop there.  Right before the US Thanksgiving holiday, SAP released support for running HANA on Power with Business Suite, specifically ERP 6.0 EHP7, CRM 7.0 EHP3 and SRM 7.0 EHP3, SAP Landscape Transformation Replication Server 2.0, HANA dynamic tiering, BusinessObjects Business Intelligence platform 4.1 SP 03, HANA smart data integration 1.0 SP02, HANA spatial SPS 11 and controlled availability of BPC[3], scale-out BW[4] using the TDI model with up to 16-nodes.  SAP plans to update the application support note as each additional application passes customer and/or internal tests, with support rolling out rapidly in the next few months.

Not enough?  Well, SAP took the next step and increased the memory per core ratio on high end systems, i.e. the E870 and E880, to 50GB/core for BW workloads thereby increasing the total memory supported in a scale-up configuration to 4.8TB.[5]

What does this mean for SAP customers?  It means that the long wait is over.  Finally, a robust, reliable, scalable and flexible platform is available to support a wide variety of HANA environments, especially those considered to be mission critical.  Those customers that were waiting for a bet-your-business solution need wait no more.  In short order, the match jumped to a pilot light, then a flame to a full cooktop.  Just wait until S/4HANA, SCM and LiveCache are supported on HoP, likely not a long wait at this rate, and the flame will have jumped to one of those jet burners used for crawfish boiling from my old home town of New Orleans!  Sorry, did I push the metaphor to far?  🙂

 

[1] 2230704 – SAP HANA on IBM Power Systems with multiple – LPARs per physical host

[2] 1995460 – Single SAP HANA VM on VMware vSphere in production

[3] 2218464 – Supported products when running SAP HANA on IBM Power Systems  and http://news.sap.com/customers-choose-sap-hana-to-run-their-business/

[4] BW Scale-out support restriction that was previously present has been removed from 2133369 – SAP HANA on IBM Power Systems: Central Release Note for SPS 09 and SPS 10

[5] 2188482 – SAP HANA on IBM Power Systems: Allowed Hardware

December 8, 2015 Posted by | Uncategorized | , , , , , , , , , | 8 Comments

HoP keeps Hopping Forward – GA Announcement and New IBM Solution Editions for SAP HANA

Almost two years ago, I speculated about the potential value of a HANA on Power solution.  In June, 2014, SAP announced a Test and Evaluation program for Scale-up BW HANA on Power.  That program shifted into high gear in October, 2014 and roughly 10 customers got to start kicking the tires on this solution.  Those customers had the opportunity to push HANA to its very limits.  Remember, where Intel systems have 2 threads per core, POWER8 has up to 8 threads per core.  Where the maximum size of most conventional Intel systems can scale to 240 threads, the IBM POWER E870 can scale to an impressive 640 threads and the E880 system can scale to 1536 threads.  This means that IBM is able to provide an invaluable test bed for system scalability to SAP.  As SAP’s largest customers move toward Suite “4” HANA (S4HANA), they need to have confidence in the scalability of HANA and IBM is leading the way in proving this capability.

A Ramp-up program began in March with approximately 25 customers around the world being given the opportunity to have access to GA level code and start to build out BW POC and production environments.  This brings us forward to the announcement by SAP this week @ SapphireNow in Orlando of the GA of HANA on Power.  SAP announced that customers will have the option of choosing Power for their BW HANA platform, initially to be used in a scale-up mode and plans to support scale-out BW, Suite on HANA and the full complement of side-car applications over the next 12 to 18 months.

Even the most loyal IBM customer knows the comparative value of other BW HANA solutions already available on the market.  To this end, IBM announced new “solution editions”.  A solution edition is simply a packaging of components, often with special pricing, to match expectations of the industry for a specific type of solution.  “Sounds like an appliance to me” says the guy with a Monty Python type of accent and intonation (no, I am not making fun of the English and am, in fact, a huge fan of Cleese and company).  True, if one were to look only at the headline and ignore the details.  In reality, IBM is looking toward these as starting points, not end points and most certainly not as any sort of implied limitation.  Remember, IBM Power Systems are based on the concept of Logical Partitions using Power Virtualization Manager (PVM).  As a result, a Power “box” is simply that, a physical container within which one or multiple logical systems reside and the size of each “system” is completely arbitrary based on customer requirements.

So, a “solution edition” simply defines a base configuration designed to be price competitive with the industry while allowing customers to flexibly define “systems” within it to meet their specific requirements and add incremental capability above that minimum as is appropriate for their business needs.  While a conventional x86 system might have 1TB of memory to support a system that requires 768GB, leaving the rest unutilized, a Power System provides for that 768GB system and allows the rest of the memory to be allocated to other virtual machines.   Likewise, HANA is often characterized by periods of 100% utilization, in support of instantaneous response time demanded of ad-hoc queries, followed by unfathomably long periods (in computer terms) of little to no activity.  Many customers might consider this to be a waste of valuable computing resource and look forward to being able to harness this for the myriad of other business purposes that their businesses actually depend on.  This is the promise of Power.  Put another way, the appliance model results in islands of automation like we saw in the 1990s where Power continues the model of server consolidation and virtualization that has become the modus operandi of the 2000s.

But, says the pitchman for a made for TV product, if you call right now, we will double the offer.  If you believe that, then you are probably not reading my blog.  If a product was that good, they would not have to give you more for the same price.  Power, on the other hand, takes a different approach.  Where conventional BW HANA systems offer a maximum size of 2TB for a single node, Power has no such inherent limitations.  To handle larger sizes, conventional systems must “scale-out” with a variety of techniques, potentially significantly increased costs and complexity.  Power offers the potential to simply “scale-up”.  Future IBM Power solutions may be able to scale-up to 4TB, 8TB or even 16TB.   In a recent post to this blog, I explained that to match the built in redundancy for mission critical reliability of memory in Power, x86 systems would require memory mirroring at twice the amount of memory with an associated increase in CPU and reduction in memory bandwidth for conventional x86 systems.  SAP is pushing the concepts of MCOS, MCOD and multi-tenancy, meaning that customers are likely to have even more of their workloads consolidated on fewer systems in the future.  This will result in demand for very large scaling systems with unprecedented levels of availability.  Only IBM is in position to deliver systems that meet this requirement in the near future.

Details on these solution editions can be found at http://www-03.ibm.com/systems/power/hardware/sod.html
In the last few days, IBM and other organizations have published information about the solution editions and the value of HANA on Power.  Here are some sites worth visiting:

Press Release: IBM Unveils Power Systems Solutions to Support SAP HANA
Video: The Next Chapter in IBM and SAP Innovation: Doug Balog announces SAP HANA on POWER8
Case study: Technische Universität München offers fast, simple and smart hosting services with SAP and IBM 
Video: Technische Universität München meet customer expectations with SAP HANA on IBM POWER8 
Analyst paper: IBM: Empowering SAP HANA Customers and Use Cases 
Article: HANA On Power Marches Toward GA

Selected SAP Press
ComputerWorld: IBM’s new Power Systems servers are just made for SAP Hana
eWEEK, IBM Launches Power Systems for SAP HANA
ExecutiveBiz, IBM Launches Power Systems Servers for SAP Hana Database System; Doug Balog Comments
TechEYE.netIBM and SAP work together again
ZDNet: IBM challenges Intel for space on SAP HANA
Data Center Knowledge: IBM Stakes POWER8 Claim to SAP Hana Hardware Market
Enterprise Times: IBM gives SAP HANA a POWER8 boost
The Platform: IBM Scales Up Power8 Iron, Targets In-Memory

Also a planning guide for HANA on Power has been published at http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP102502 .

May 7, 2015 Posted by | Uncategorized | , , , , , , , , , , | 3 Comments

How to ensure Business Suite on HANA infrastructure is mission critical ready

Companies that plan on running Business Suite on HANA (SoH) require systems that are at least as fault tolerant as their current mission critical database systems.  Actually, the case can be made that these systems have to exceed current reliability design specifications due to the intrinsic conditions of HANA, most notably, but not limited to, extremely large memory sizes.  Other factors that will further exacerbate this include MCOD, MCOS, Virtualization and the new SPS09 feature, Multi-Tenancy.

A customer with 5TB of data in their current uncompressed Suite database will most likely see a reduction due to HANA compression (SAP note 1793345, and the HANA cookbook²) bringing their system size, including HANA work space, to roughly 3TB.  That same customer may have previously been using a database buffer of 100GB +/- 50GB.  At a current buffer size of 100GB, their new HANA system will require 30 times the amount of memory as the conventional database did.  All else being equal, 30x of any component will result in 30x failures.  In 2009, Google engineers wrote a white paper in which they noted that 8% of DIMMS experienced errors every year with most being hard errors and that when a correctable error occurred in a DIMM, there was a much higher chance that another would occur in that same DIMM leading, potentially, to uncorrectable errors.¹  As memory technology has not changed much since then, other than getting denser which could lead to even more likelihood of errors due to cosmic rays and other sources, the risk has likely not decreased.  As a result, unless companies wish to take chances with their most critical asset, they should elect to use the most reliable memory available.

IBM provides exactly that, the best of breed open systems memory reliability, not as an option at a higher cost, but included with every POWER8 system, from the one and two socket scale-out systems to even more advanced capabilities with the 4 & 8-socket systems, some of which will scale to 16-sockets (announced as a Statement of Direction for 2015).  This memory protection is represented in multiple discreet features that work together to deliver unprecedented reliability.  The following gets into quite a bit of technical detail, so if you don’t have your geek hat on, (mine can’t be removed as it was bonded to my head when I was reading Heinlein in 6th grade; yes, I know that dates me), then you may want to jump to the conclusions at the end.

Chipkill – Essentially a RAID like technology that spans data and ECC recovery information across multiple memory chips such that in the event of a chip failure, operations may continue without interruption.   Using x8 chips, Chipkill provides for Single Device Data Correction (SDDC) and with x4 chips, provides Double Device Data Correction (DDDC) due to the way in which data and ECC is spread across more chips simultaneously.

Spare DRAM modules – Each rank of memory (4 ranks per card on scale-out systems, 8 ranks per card on enterprise systems) contains an extra memory chip.  This chip is used to automatically rebuild the data that was held, previously, on the failed chip in the above scenario.  This happens transparently and automatically.  The effect is two-fold:  One, once the recovery is complete, no additional processing is required to perform Chipkill recovery allowing performance to return to pre-failure levels; Two, maintenance may be deferred as desired by the customer as Chipkill can, yet again, allow for uninterrupted operations in the event of a second memory chip failure and, in fact, IBM does not even make a call out for repair until a second chip fails.

Dynamic memory migration and Hypervisor memory mirroring – These are unique technologies only available on IBM’s Enterprise E870 and E880 systems.  In the event that a DIMM experiences errors that cannot be permanently corrected using sparing capability, the DIMM is called out for replacement.  If the ECC is capable of continuing to correct the errors, the call out is known as a predictive callout indicating the possibility of a future failure.  In such cases, if an E870 or E880 has unlicensed or unassigned DIMMS with sufficient capacity to handle it, logical memory blocks using memory from a predictively failing DIMM will be dynamically migrated to the spare/unused capacity. When this is successful this allows the system to continue to operate until the failing DIMM is replaced, without concern as to whether the failing DIMM might cause any future uncorrectable error.  Hypervisor memory mirroring is a selective mirroring technology for the memory used by the hypervisor which means that even a triple chip failure in a memory DIMM would not affect the operations of the hypervisor as it would simply start using the mirror.

L4 cache – Instead of conventional parity or ECC protected memory buffers used by other vendors, IBM utilizes special eDRAM (a more reliable technology to start with) which not only offers dramatically better performance but includes advanced techniques to delete cache lines for persistent recoverable and non-recoverable fault scenarios as well as to deallocate portions of the cache spanning multiple cache lines.

Extra memory lane – the connection from memory DIMMs or cards is made up of dozens of “lanes” which we can see visually as “pins”.  POWER8 systems feature an extra lane on each POWER8 chip.  In the event of an error, the system will attempt to retry the transfer, use ECC correction and if the error is determined by the service processor to be a hard error (as opposed to a soft/transient error), the system can deallocate the failing lane and allocate the spare lane to take its place.  As a result, no downtime in incurred and planned maintenance may be scheduled at a time that is convenient for the customer since all lanes, including the “replaced” one are still fully protected by ECC.

L2 and L3 Caches likewise have an array of protection technology including both cache line delete and cache column repair in addition to ECC and special hardening called “soft latches” which makes these caches less susceptible to soft error events.

As readers of my blog know, I rarely point out only one side of the equation without the other and in this case, the contrast to existing HANA capable systems could not be more dramatic making the symbol between the two sides a very big > symbol; details to follow.

Intel offers a variety of protection technologies for memory but leaves the decision as to which to employ up to customers.  This ranges from “performance mode” which has the least protection to “RAS mode” which has more protection at the cost of reduced performance.

Let’s start with the exclusives for IBM:  eDRAM L4 cache with its inherent superior protection and performance over conventional memory buffer chips, dynamic memory migration and hypervisor memory mirroring available on IBM Enterprise class servers, none of which are available in any form on x86 servers.  If these were the only advantages for Power Systems, this would already be compelling for mission critical systems, but this is only the start:

Lock step – Intel included similar technology to Chipkill in all of their chips which they call Lock step.  Lock step utilizes two DIMMs behind a single memory buffer chip to store a 64-byte cache line + ECC data instead of the standard single DIMM to provide 1x or 2x 8-bit error detection and 8-bit error correction within a single x8 or x4 DRAM respectively (with x4 modules, this is known as Double Device Data Correction or DDDC and is similar to standard POWER Chipkill with x4 modules.)  Lock Step is only available in RAS mode which incurs a penalty relative to performance mode.  Fujitsu released a performance white paper³ in which they described the results of a memory bandwidth benchmark called STREAM in which they described Lock step memory as running at only 57% of the speed of performance mode memory.

Lock step is certainly an improvement over standard or performance mode in that most single device events can be corrected on the fly (and two such events serially for x4 DIMMS) , but correction incurs a performance penalty above and beyond that incurred from being in Lock step mode in the first place.  After the first such failure, for x8 DIMMS, the system cannot withstand a second failure in that Lockstep pair of DIMMS and a callout for repair (read this as make a planned shutdown as soon as possible) be made to prevent a second and fatal error.  For x4 DIMMS, assuming the performance penalty is acceptable, the planned shutdown could be postponed to a more convenient time.  Remember, with the POWER spare DRAMS, no such immediate action is required.

Memory sparing – Since taking an emergency shutdown is unacceptable for a SoH system, Lock Step memory is therefore insufficient since it handles only the emergency situation but does not eliminate the need for a repair action (as the POWER memory spare does) and it incurs a performance penalty due to having to “lash” together two cards to act as one (as compared to POWER that achieves superior reliability with a single memory card).  Some x86 systems offer memory sparing in which one rank per memory channel is configured as a spare.  For instance, with the Lenovo System x x3850, each memory channel supports 3 DIMMs or ranks.  If sparing is used, the effective memory throughput of the system is reduced by 1/3 since one of every 3 DIMMs is no longer available for normal operations and the memory that must be purchased is increased by 50%.  In other words, 1TB of usable memory requires 1.5TB of installed memory.  The downsize of sparing is that it is a predictive failure technology, not a reactive one.  According to the IBM X6 Servers: Technical Overview Redbook-  “Sparing provides a degree of redundancy in the memory subsystem, but not to the extent of mirroring. In contrast to mirroring, sparing leaves more memory for the operating system. In sparing mode, the trigger for failover is a preset threshold of correctable errors. When this threshold is reached, the content is copied to its spare. The failed rank is then taken offline, and the spare counterpart is activated for use.”  In other words, this works best when you can see it coming, not after a part of the memory has failed.    When I asked a gentleman manning the Lenovo booth at TechEd && d-code about sparing, he first looked at me as if I had a horn sticking out of my head and then replied that almost no one uses this technology.  Now, I think I understand why.  This is a good option, but at a high cost and still falls short of POWER8 memory protection which is both predictive and reactive and dynamically responds to unforeseen events.  By comparison, memory sparing requires a threshold to be reached and then enough time to be available to complete a full rank copy, even if only a single chip is showing signs of imminent failure.

Memory mirroring – This technology utilizes a complete second set of memory channels and DIMMs to maintain a second copy of memory at all times.  This allows for a chip or an entire DIMM to fail with no loss of data as the second copy immediately takes over.  This option, however, does require that you double the amount of memory in the system, utilize plenty of system overhead to keep the pairs synchronized and take away ½ of the memory bandwidth (the other half of which goes to the copy).  This option may perform better than the memory sparing option because reads occur from both copies in an interleaved manner, but writes have to occur to both synchronously.

Conclusions:

Memory mirroring for x86 systems is the closest option to the continuous memory availability that POWER8 delivers.  Of course, having to purchase 2TB of memory in order to have proper protection of 1TB of effective memory adds a significant cost to the system and takes away substantial memory bandwidth.  HANA utilizes memory as few other systems do.

The problem is that x86 vendors won’t tell customers this.  Why?  Now, I can only speculate, but that is why I have a blog.  The x86 market is extremely competitive.  Most customers ask multiple vendors to bid on HANA opportunities.  It would put a vendor at a disadvantage to include this sort of option if the customer has not required it of all vendors.  In turn, x86 vendors don’t won’t to even insinuate that they might need such additional protection as that would imply a lack of reliability to meet mission critical standards.

So, let’s take this to the next logical step.  If a company is planning on implementing SoH using the above protection, they will need to double their real memory.  Many customers will need 4TB, 8TB or even some in the 12TB to 16TB range with a few even larger.  For the 4TB example, an 8TB system would be required which, as of the writing of this blog post, is not currently certified by SAP.  For the 8TB example, 16TB would be required which exceeds most x86 vendor’s capabilities.  At 12TB, only two vendors have even announced the intention of building a system to support 24TB and at 16TB, no vendor has currently announced plans to support 32TB of memory.

Oh, by the way, Fujitsu, in the above referenced white paper, measured the memory throughput of a system with memory mirroring and found it to be 69% that of a performance optimized system.  Remember, HANA demands extreme memory throughput and benchmarks typically use the fastest memory, not necessarily the most reliable meaning that if sizings are based on benchmarks, they may require adjustment when more reliable memory options are utilized.  Would larger core counts then be required to drive the necessary memory bandwidth?

Clearly, until SAP writes new rules to accommodate this necessary technology or vendors run realistic benchmarks showing just how much cpu and memory capacity is needed to support a properly mirrored memory subsystem on an x86 box, customers will be on their own to figure out what to do.

That guess work will be removed once HANA on Power GAs as it already includes the mission critical level of memory protection required for SoH and does so without any performance penalty.

Many thanks to Dan Henderson, IBM RAS expert extraordinaire, from whom I liberally borrowed some of the more technically accurate sentences in this post from his latest POWER8 RAS whitepaper¹¹ and who reviewed this post to make sure that I properly represented both IBM and non-IBM RAS options.

¹ http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
² https://cookbook.experiencesaphana.com/bw/operating-bw-on-hana/hana-database-administration/monitoring-landscape/memory-usage/
³ http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CD0QFjAA&url=http%3A%2F%2Fdocs.ts.fujitsu.com%2Fdl.aspx%3Fid%3D8ff6579c-966c-4bce-8be0-fc7a541b4a02&ei=t9VsVIP6GYW7yQTGwIGICQ&usg=AFQjCNHS1fOnd_QAnVV6JjRju9iPlAZkQg&bvm=bv.80120444,d.aWw
¹¹ http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=WH&infotype=SA&appname=STGE_PO_PO_USEN&htmlfid=POW03133USEN&attachment=POW03133USEN.PDF#loaded.

November 19, 2014 Posted by | Uncategorized | , , , , , , , , | 2 Comments

Why SAP HANA on IBM Power Systems

This entry has been superseded by a new one: https://saponpower.wordpress.com/2014/06/06/there-is-hop-for-hana-hana-on-power-te-program-begins/

 

Before you get the wrong impression, SAP has not announced the availability of HANA on Power and in no way, should you interpret this posting as any sort of pre-announcement.  This is purely a discussion about why you should care whether SAP decides to support HANA on Power .

 

As you may be aware, during SAP’s announcement of the availability of HANA for Business Suite DB for ramp-up customers in early January, Vishal Sikka, Chief Technology Officer and member of the Executive Board at SAP, stated “We have been working heavily with [IBM].  All the way from lifecycle management and servers to services even Cognos, Business Intelligence on top of HANA – and also evaluating the work that we have been doing on POWER.  As to see how far we can be go with POWER – the work that we have been doing jointly at HPI.  This is a true example of open co-innovation, that we have been working on.”  Ken Tsai, VP of SAP HANA product marketing, later added in an interview with IT Jungle,  “Power is something that we’re looking at very closely now.” http://www.itjungle.com/fhs/fhs011513-story01.html.  And from Amit Sinha, head of database and technology product marketing. “[HANA] on Power is a research project currently sponsored at Hasso Plattner Institute. We await results from that to take next meaningful steps jointly with IBM,”  Clearly, something significant is going on.  So, why should you care?

 

Very simply, the reasons why customers chose Power Systems (and perhaps HP Integrity and Oracle/Fujitsu SPARC/Solaris) for SAP DBs in the past, i.e. scalability, reliability, security, are just as relevant now with HANA as in the past with conventional databases, perhaps even more so.  Why more so?  Because once the promise of real time analytics on an operational database is realized, not necessarily in Version 1.0 of the product, but in the future undoubtedly, then the value obtained by this capability would result in the exact same loss of value if the system was not available or did not respond with the speed necessary for real time analytics.

 

A little known fact is that HANA for Business Suite DB currently is limited to a single node.  This means that scale out options, common in the BW HANA space and others, is not an option for this implementation of the product.  Until that becomes available, customers that wish to host large databases may require a larger number of cores than x86 vendors currently offer.

 

A second known but often overlooked fact is that parallel transactional database systems for SAP are often complex, expensive and have so many limitations that only two types of customers consider this option; those which need continuous or near continuous availability and those that want to move away from a robust UNIX solution and realize that to attain the same level of uptime as a single node UNIX system with conventional HA, an Oracle RAC or DB2 PureScale cluster is required.  Why is it so complex?  Without getting into too much detail, we need to look at the way SAP applications work and interact with the database.  As most are aware, when a user logs on to SAP, they are connecting to a unique application server and until they log off, will remain connected to that server.  Each application server is, in turn, connected to one node of a parallel DB cluster.  Each request to read or write data is sent to that node and if the data is local, i.e. in the memory of that node, the processing occurs very rapidly.  If, on the other hand, the data is on another node, that data must be moved from the remote node to the local node.  Oracle RAC and DB2 PureScale use two different approaches with Oracle RAC using their Cache Fusion to move the data across an IP network and DB2 PureScale using Remote DMA to move the data across the network without using an IP stack, thereby improving speed and reducing overhead.  Though there may be benefits of one over the other, this posting is not intended to debate this point, but instead point out that even with the fastest, lowest overhead transfer on an Infiniband network, access to remote memory is still thousands of time slower than accessing local memory.

 

Some applications are “cluster aware”, i.e. application servers connect to multiple DB nodes at the same time and direct traffic based on data locality which can only be possible if the DB and App servers work cooperatively to communicate what data is located where.  SAP Business Suite is not currently cluster aware meaning that without a major change in the Netweaver Stack, replacing a conventional DB with a HANA DB will not result in cluster awareness and the HANA DB for Business Suite may need to remain as a single node implementation for some time.

 

Reliability and Security have been the subject of previous blog posts and will be reviewed in some detail in an upcoming post.  Clearly, where some level of outages may be tolerable for application servers due to an n+1 architecture, few customers consider outages of a DB server to be acceptable unless they have implemented a parallel cluster and even then, may be mitigated, but still not considered tolerable.  Also, as mentioned above, in order to achieve this, one must deal with the complexity, cost and limitations of a parallel DB.  Since HANA for Business Suite is a single node implementation, at least for the time being, an outage or security intrusion would result in a complete outage of that SAP instance, perhaps more depending on interaction and interfaces between SAP components.  Power Systems has a proven track record among Medium and Large Enterprise SAP customers of delivering the lowest level of both planned and unplanned outages and security vulnerabilities of any open system.

 

Virtualization and partition mobility may also be important factors to consider.  As all Power partitions are by their very definition “virtualized”, it should be possible to dynamically resize a HANA DB partition, host multiple HANA DB partitions on the same system and even move those partitions around using Live Partition Mobility.  By comparison, an x86 environment lacking VMware or similar virtualization technology could do none of the above.  Though, in theory, SAP might support x86 virtualization at some point for production HANA Business Suites DBs, they don’t currently and there are a host of reasons why they should not which are the same reasons why any production SAP databases should not be hosted on VMware as I discussed in my blog posting:  https://saponpower.wordpress.com/2011/08/29/vsphere-5-0-compared-to-powervm/   Lacking x86 virtualization, a customer might conceivably need a DB/HA pair of physical machines for each DB instance compared to potentially a single DB/HA pair for a Power based virtualized environment.

 

And now a point of pure speculation; with a conventional database, basis administrators and DBAs weigh off the cost/benefit of different levels in a storage hierarchy including main memory, flash and HDDs.  Usually, main memory is sized to contain upwards of 95% of commonly accessed data with flash being used for logs and some hot data files and HDDs for everything else.  For some customers, 30% to 80% of an SAP database is utilized so infrequently that keeping aged items in memory makes little sense and would add cost without any associated benefit.  Unlike conventional DBs, with HANA, there is no choice.  100% of an SAP database must reside in memory with flash used for logs and HDDs used for a copy of the data in memory.  Not only does this mean radically larger amounts of memory must be used but as a DB grows, more memory must be added over time.  Also, more memory means more DIMMS with an associated increase in DIMM failure rates, power consumption and heat dissipation.  Here Power Systems once again shines.  First, IBM offers Power Systems with much larger memory capabilities but also offers Memory on Demand on Power 770 and above systems.  With this, customers can pay for just the memory they need today and incrementally and non-disruptively add more as they need it.  That is not speculation, but the following is.  Power Systems using AIX offers Active Memory Expansion (AME), a unique feature which allows infrequently accessed memory pages to be placed into a compress pool which occupies much less space than uncompressed pages.  AIX then transparently moves pages between uncompressed and compressed pools based on page activity using a hardware accelerator in POWER7+.  In theory, a HANA DB could take advantage of this in an unprecedented way.  Where test with DB2 have shown a 30% to 40% expansion rate (i.e. 10GB of real memory looks like 13GB to 14GB to the application), since potentially far more of a HANA DB would have low use patterns that it may be possible to size the memory of a HANA DB at a small fraction of the actual data size and consequently at a much lower cost plus associated lower rates of DIMM failures, less power and cooling.

 

If you feel that these potential benefits make sense and that you would like to see a HoP option, it is important that you share this desire with SAP as they are the only ones that can make the decision to support Power.  Sharing your desire does not imply that you are ready to pull the trigger or that you won’t consider all available option, simply that you would like to get informed about SAP’s plans.  In this way, SAP can gauge customer interest and you can have the opportunity to find out which of the above suggested benefits might actually be part of a HoP implementation or even get SAP to consider supporting one or more of them that you consider to be important.  Customers interested in receiving more detailed information on the HANA on Power effort should approach their local SAP Account  Executive in writing, requesting disclosure information on this platform technology effort.

March 14, 2013 Posted by | Uncategorized | , , , , , , , , | 5 Comments

Oracle M9000 SAP ATO Benchmark analysis

SAP has a large collection of different benchmark suites.  Most people are familiar with the SAP Sales and Distribution (SD) 2-tier benchmark as the vast majority of all results have been published using this benchmark suite.   A lesser known benchmark suite is called ATO or Assemble-to-Order.  When the ATO benchmark was designed it was intended to replace SD as a more “realistic” workload. As the benchmark is a little more complicated to run and SAP Quicksizer sizings are based on the SD workload the ATO benchmark never got much traction and from 1998 through 2003, only 19 results were published.  Prior to September 2, 2011, this benchmark had seemed to become extinct.  On that date, Oracle and Fujitsu, published a 2-tier result for the SPARC M9000 along with the predictable claim of world record result.  Oracle should be commended for having beaten the results published in 2003.  Of course, we might want to consider that a 2-processor/12-core, 2U Intel based system of today has already surpassed the TPC-C results of a 64-core HP Itanium Superdome that “set the record” back in 2003 at a tiny fraction of the cost and floor space.

 

So we give Oracle a one-handed clap for this “accomplishment”.  But if I left it at that, you might question why I would even bother to post this blog entry.  Let’s delve a little deeper to find the story within the story.  First let me remind the reader, these are my opinions and in no way do they reflect the opinions of IBM nor has IBM endorsed or reviewed my opinions.

 

In 2003, Fujitsu-Siemens published a couple of ATO results using a predecessor of today’s SPARC64 VII chip called SPARC64TM V at 1.35GHz and SAP 4.6C.  The just published M9000 result used the SPARC64 VII at 3.0GHz and SAP EP4 for SAP ERP 6.0 with Unicode.  If one were to divide the results achieved by both systems by the number of cores and compare them, one might find that the new results deliver about a very small increase in throughput per core of roughly 6% over the old results.  Of course, this does not account for the changes in SAP software, Unicode or benchmark requirements.   SAP rules do not allow for extrapolations, so I will instead provide you with the data from which to make your own calculations.  100 SAPS using SAP 4.6c is equal to about 55 SAPS using Business Suite 7 with Unicode.   If you were to multiply the old result by 55/100 and then divide by the number of cores, you could determine the effective throughput per core of the old system if it were running the current benchmark suite.  I can’t show you the result, but will show you the formula that you can use to determine this result yourself at the end of this posting.

 

For comparison, I wanted to figure out how Oracle did on the SD 2-tier benchmark compared to systems back in 2003.  Turns out that almost identical systems were used both in 2003 and in 2009 with the exception of the Sun M9000 which used 2.8GHz processors each of which had half of the L2 cache of the 3.0GHz system used in the ATO benchmark.  If you were to use a similar formula to the one described above and then perhaps multiply by the difference in MHz, i.e. 3.0/2.8 you could derive a similar per core performance comparison of the new and old systems.  Prior to performing any extrapolations, the benchmark users per core actually decreased between 2003 and 2009 by roughly 10%.

 

I also wanted to take a look at similar systems from IBM then and now.  Fortunately, IBM published SD 2-tier results for the 8-core 1.45GHz pSeries 650 in 2003 and for the a 256-core 4.0GHz Power 795 late last year with the SAP levels being identical to the ones used by Sun and Fujitsu-Siemens respectively.  Using the same calculations as were done for the SD and ATO comparisons above, IBM achieved 223% more benchmark users per core than they achieved in 2003 prior to any extrapolations.

 

Yes, there was no typo there.  While the results by IBM improved by 223% on a per core basis, the Fujitsu processor based systems either improved by only 9% or decreased by 10% depending on which benchmark you chose.  Interestingly enough, IBM had only a 9% per core advantage over Fujitsu-Siemens in 2003 which increased to a 294% advantage in 2009/2010 based on the SD 2-tier benchmark.

 

It is remarkable that since November 18, 2009, Oracle(Sun) has not published a single SPARC based SAP SD benchmark result while over 70 results were published by a variety of vendors, including two by Sun for their Intel systems.  When Oracle finally decided to get back into the game to try to prove their relevance despite a veritable flood of analyst and press suggestions to the contrary, rather than competing on the established and vibrant SD benchmark, they choose to stand on top of a small heap of dead carcasses to say they are better than the rotting husks upon which they stand.

 

For full disclosure, here are the actual results:

SD 2-tier Benchmark Results

Certification Date        System                                                                                # Benchmark Users               SAPS                Cert #

1/16/2003                 IBM eServer pSeries 650, 8-cores                                                  1,220                            6,130               2003002

3/11/2003                 Fujitsu Siemens Computers, PrimePower 900,  8-cores                     1,120                            5,620               2003009

3/11/2003                 Fujitsu Siemens Computers, PrimePower 900, 16-cores                    2,200                            11,080              2003010

11/18/2009               Sun Microsystems, M9000, 256-cores                                             32,000                          175,600            2009046

11/15/2010               IBM Power 795, 256-cores                                                           126,063                        688,630            2010046

 

ATO 2-tier results:

Certification Date        System                                                                     Fully Processed Assembly Orders/Hr            Cert #

3/11/2003                 Fujitsu Siemens Computers, PrimePower 900,  8-cores                6,220                                        2003011

03/11/2003               Fujitsu Siemens Computers, PrimePower 900, 16-cores               12,170                                       2003012

09/02/2011               Oracle M9000, 256-cores                                                        206,360                                     2011033

 

Formulas that you might use assuming you agree with the assumptions:

 

Performance of old system / number of cores * 55/100 = effective performance per core on new benchmark suite (EP)

 

(Performance of new system / cores ) / EP = relative ratio of performance per core of new system compared to old system

 

Improvement per core = 1 – relative ratio

 

This can be applied to both the SD and ATO results using the appropriate throughput measurements.

September 9, 2011 Posted by | Uncategorized | , , , , , , , , , , , , , , , , , , | Leave a comment

vSphere 5.0 compared to PowerVM

Until recently, VMware partitions suffered from a significant scalability limitation. Each partition could scale to a maximum of 8 virtual processors (vp) with vSphere 4.1 Enterprise Edition. For many customers and uses, this did not pose much of an issue as some of the best candidates for x86 virtualization are the thousands of small, older servers which can easily fit within a single core of a modern Intel or AMD chip. For SAP customers, however, the story was often quite different. Eight vp does not equate to 8 cores, it equates to 8 processor threads. Starting with Nehalem, Intel offered HyperThreading which allowed each core to run two different OS threads simultaneously. This feature boosted throughput, on average, by about 30% and just about all benchmarks since that time have been run with HyperThreading enabled. Although it is possible to disable it, few customers elect to do so as it removes that 30% increased throughput from the system. With HyperThreading enabled, 8 VMware vp utilize 4 cores/8 threads which can be as little as 20% of the cores on a single chip. Put in simple terms, this can be as little as 5,000 SAPS depending on the version and MHz of the chip. Many SAP customers routinely run their current application servers at 5,000 to 10,000 SAPS, meaning moving these servers to VMware partitions would result in the dreaded hotspot, i.e. bad performance and a flood of calls to the help desk. By comparison, PowerVM (IBM’s Power Systems virtualization technology) partitions may scale as large as the underlying hardware and if that limit is reached, may be migrated live to a larger server, assuming one exists in the cluster, and the partition allowed to continue to operate without interruption and a much higher partition size capability.

 
VMware recently introduced vSphere 5.0. Among a long list of improvements is the ability to utilize 32 vp for a single partition. On the surface, this would seem to imply that VMware can scale to all but a very few large demands. Once you dig deeper, several factors emerge. As vSphere 5.0 is very new, there are not many benchmarks and even less customer experience. There is no such thing as a linearly scalable server, despite benchmarks that seem to imply this, even from my own company. All systems have a scalability knee of the curve. Where some workloads, e.g. AIM7, when tested by IBM showed up to 7.5 times the performance with 8 vp compared to 1 vp on a Xeon 5570 system with vSphere 4.0 update 1, it is worthwhile to note that this was only achieved when no other partitions were running, clearly not the reason why anyone would utilize VMware. In fact, one would expect just the opposite, that an overcommitment of CPU resources would be utilized to get the maximum throughput of a system. On another test, DayTrader2.0 in JDBC mode, a scalability maximum of 4.67 the performance of a single thread was reached with 8 vp, once again while running no other VMs. It would be reasonable to assume that VMware has done some scaling optimization but it would be premature and quite unlikely to assume that 32 vp will scale even remotely close to 4 times the performance of an 8 vp VM. When multiple VMs run at the same time, VMware overhead and thread contention may reduce effective scaling even further. For the time being, a wise customer would be well advised to wait until more evidence is presented before assuming that all scaling issues have been resolved.

But this is just one issue and, perhaps, not the most important one. SAP servers are by their very nature, mission critical. For database servers, any downtime can have severe consequences. For application servers, depending on how customers implement their SAP landscapes and the cost of downtime, some outages may not have as large the consequence. It is important to note that when an application server fails, the context for each user ‘s session is lost. In a best case scenario, the users can recall all necessary details to re-run the transactions in flight after re-logging on to another application server. This means that the only loss is the productivity of that user multiplied by the number of users previously logged on and doing productive work on that server. Assuming 500 users and 5 minutes to get logged back on, transaction initiated through to completion, this is only 2,500 minutes of lost productivity which at a loaded cost of $75,000 per employee is only a total loss to the company of $1,500 per occurrence. With one such occurrence per application server per year, this would result in $6,000 of cost over 5 years and should be included in any comparison of TCO. Of course, this does not take into consideration any IT staff time required to fix the server, any load on the help desk to help resolve issues, nor any political cost to IT if failures happen too frequently. But what happens if the users are unable to recall all of the details necessary to re-run the transactions or what happens if tight integration with production requires that manufacturing be suspended until all users are able to get back to where they had been? The costs can escalate very quickly.

So, what is my point? All x86 hypervisors, including VMware 4.1 and 5.0, are software layers on top of the hardware. In the event of an uncorrectable error in the hardware, the hypervisor usually fails and, in turn, takes down all VMs that it is hosting. Furthermore, problems are not just confined to the CPU, but could be caused by memory, power supplies, fans or a large variety of other components. I/O is yet another critical issue. VMware provides shared I/O resources to partitions, but it does this sharing within the same hypervisor. A device driver error, physical card error or, in some cases, even an external error in a cable, for example, might result in a hypervisor critical error and resulting outage. In other words, the hypervisor becomes a very large single point of failure. In order to avoid the sort of costs described above, most customers try to architect mission critical systems to reduce single points of failure not introduce new ones.

PowerVM takes the opposite approach. First, it is implemented in hardware and firmware. As the name implies, hardware is hardened meaning it is inherently more reliable and far less code is required since many functions are built into the chip.

Second, PowerVM acts primarily as an elegant dispatcher. In other words, it decides which partition executes next in a given core, but then it gets out of the way and allows that partition to execute natively in that core with no hypervisor in the middle of it. This means that if an uncorrectable error were to occur, an exceedingly rare event for Power Systems due to the wide array of fault tolerant components not available in any x86 server, in most situations the error would be confined to a single core and the partition executing in that core at that moment.

Third, sharing of I/O is done through the use of a separate partition called the Virtual I/O (VIO) server. This is done to remove this code from the hypervisor, thereby making they hypervisor more resilient and also to allow for extra redundancy. In most situations, IBM recommends that customers utilize more than one VIO server and spread I/O adapters across those servers with redundant virtual connections to each partition. This means that if an error were to occur in a VIO server, once again a very rare event, only the VIO server might fail, but the other VIO servers would not fail and there would be no impact on the hypervisor since it is not involved in the sharing of I/O at all. Furthermore, partitions would not fail since they would be multipathing virtual devices across more than one VIO server.

So even if VMware can scale beyond 8vp, the question is how much of your enterprise are you ready to place on a single x86 server? 500 users? 1,000 users? 5,000 users? Remember, 500 users calling the help desk at one time would result in long delays. 1,000 at the same time would result in many individuals not waiting and calling their LOB execs instead.

In the event that this is not quite enough of a reason to select Power and PowerVM over x86 with VMware, it is worthwhile to consider the security exposure differences. This has been covered already in a prior blog entry comparing Power to x86 servers, but is worthwhile noting again. PowerVM has no known vulnerabilities according to the National Vulnerability Database, http:// nvd.nist.gov. By comparison, a search on that web site for VMware results in 119 hits. Admittedly, this includes older versions as well as workstation versions, but it is clear that hackers have historically found weaknesses to exploit. VMware has introduced vShield with vSphere 5.0, a set of technologies intended to make VMware more secure, but only it would be prudent to wait and see if this closes all holes or opens new ones.

Also covered in the prior blog entry, the security of the hypervisor is only one piece of the equation. Equally, or perhaps more important is the security of the underlying OSs. Likewise, AIX is among the least vulnerable OSs with Linux and Windows having an order of magnitude more vulnerabilities. Also covered in that blog was a discussion about problem isolation, determination and vendor ownership of problems to drive them to successful resolution. With IBM, almost the entire stack is owned by IBM and supported for mission critical computing whereas with x86, the stack is a hodgepodge of vendors with different support agreements, capabilities and views on who should be responsible for a problem often resulting in finger pointing.

There is no question that VMware has tremendous potential for applications that are not mission critical as well as being an excellent fit for many non-production environments. For SAP, the very definition of mission critical, a more robust, more scalable and better secured environment is needed and Power Systems with PowerVM does an excellent job of delivering on these requirements.

Oh, I did not mention cost. With the new memory based pricing model for vSphere 5.0, applications such as SAP, which demand enormous quantities of memory, may easily exceed the new limits for memory pool size forcing the purchase of additional VMware licenses. Those extra license costs and their associated maintenance, can easily add enough cost that the price differences, if there are any, between Power and x86 further close to be almost meaningless.

August 29, 2011 Posted by | Uncategorized | , , , , , , , , , , , , , , , , , , , , , | 5 Comments

IBM Power Systems compared to x86 for SAP landscapes

It seems like every other day, someone asks me to help them justify why a customer should select IBM Power Systems over x86 alternatives for new or existing SAP customers. Here is a short summary of the key attributes that most customers require and the reasons why Power Systems excels or conversely, where x86 systems fall short.

TCO – Total Cost of Ownership is usually at the top of everyone’s list. Often this is confused with TCA or Total Cost of Acquisition. TCA can be very important for some individuals within customer organizations, especially when those individuals are only responsible for capital acquisition costs and not operational costs such as maintenance, power, cooling, floor space, personnel, software and other assorted costs. TCA can also be important when only capital budgets are restricted. For most customers, however, TCO is far more important. Some evaluators compare systems, one for one. While this might seem to make sense, would it be reasonable to compare a pickup truck and an 18-wheeler semi? Obviously not, so, to do a fair job of comparing TCO, a company must look at all aspects, purposes and effects of different choices. For instance, with IBM Power Systems, customers routinely utilize PowerVM, the IBM Power virtualization technology, to combine many different workloads including ERP, CRM, BW, EP, SCM, SRM and other production database and application servers, high availability servers, backup/recovery servers and non-production servers onto a single, small set of servers. While some of this is possible with x86 virtualization technologies, it is rarely done, partly due to “best practices” separation of workloads and also due to support restrictions by some software products, such as Oracle database, when used in a virtualized x86 environment. This typically results in a requirement for many more servers. Likewise, many Power Systems customers routinely drive their utilization to 80% or higher, where the best of x86 virtualization customers rarely drive to even 50% utilization. Taken together, it is very common to see 2 or 3 times the number of systems for x86 customers than for equivalently sized Power Systems customers and I provided only two reasons of the many frequently experienced by SAP customers. So, where an individual Power System might be slightly higher in cost than the equivalent x86 server, full SAP landscapes on Power Systems often require far fewer systems. Between a potentially lower cost of acquisition and the associated lower cost of management, less power, cooling, floor space and often lower cost of third party software, customers can see a significantly lower TCO with IBM Power Systems.

For customers which are approaching the limits on their data centers, either in terms of floor space, power or cooling, x86 horizontal proliferation may drive the need for data center expansion that could cost into the many millions of dollars. Power Systems may help customers to achieve radically higher levels of consolidation through its far more advanced virtualization and much higher scalability thereby potentially avoiding the need for that data center expansion. The savings, in this event, would make the other savings seem trivial by comparison.

Reliability – A system which is low cost but suffers relatively high numbers of outages may not be the best option for mission critical systems such as SAP. IBM Power Systems feature an impressive array of reliability technologies that are not available on any x86 system. This starts with failure detection circuitry which is built into the entire system including the processor chips and is called First Failure Data Capture (FFDC). FFDC has been offered and improved upon since the mid-90’s for Power Systems and its predecessors. This unique technology captures soft and hard errors from within the hardware allowing the service processor, standard with every system, to predict failures which could impact application availability and take preventive action such as dynamically deallocating components from adapter cards to memory and cache lines and even processor cores. Intel, starting with Nehalem-EX, offers Machine Check Architecture Recovery (MCA), their first version of a similar concept. As a first version, it is doubtful that it can approach the much more mature FFDC technology from IBM. Even more important is the “architecture” which, once errors are detected, passes that information, not to a service processor, but to the Operating System or Virtualization Manager with the “option” for that software to fix the problem in the hardware. This is like your car telling you that your braking system has a problem. Even if you have the mechanical ability to run advanced diagnostics, remove and replace parts, bleed the system, etc., this would involve a significant outage and most certainly could not be done on the fly. Likewise, it is extremely doubtful that Microsoft, for instance, is going to invest in software to fix a problem in an Intel processor especially since this area is likely going to change and only addresses one potential area of reliability. Furthermore, does Microsoft actually want to take on responsibility for hardware reliability? This is just one example, of many, that affect uptime, but without which SAP systems can be exposed.

Equally important is what happens if a problem does occur. Unless you are very lucky, you have experienced the Blue Screen of Death at least one or a hundred times in your past. This is one of those wonderful things that can occur when you don’t have a comprehensive reliability architecture such as that with IBM Power Systems. With x86 systems, essentially, the OS reports that a problem has occurred which could be related to the CPU, system hardware, OS, device driver, firmware, memory, application software, adapter cards, etc. and that your best course of action is to remove the last thing you installed and reboot your system. When you call your system vendor, they might suggest that you contact your OS vendor which might suggest you contact your virtualization vendor which might suggest the problem lies in your BIOS and on and on. Who takes responsibility and ownership and drives the problem to resolution? With IBM Power Systems, IBM develops and supports its own CPU, firmware, system hardware, virtualization, device drivers, OS (assuming AIX or i for Business), memory controllers and buffer chips and has a comprehensive set of rules and detection circuitry for third party hardware and software. This means that in the very rare event of an intermittent or hard to identify error occurs, which is not detected and corrected automatically, IBM takes ownership and resolves the problem unless it is determined that a third party piece of hardware or software caused the problem. In that case, IBM works diligently with its partners to resolve which includes IBM personnel that work on site at many of their partner locations such as Oracle and SAP.

Security – Often an afterthought, but potentially an extremely expensive one, should be carefully considered. PowerVM has never been successfully hacked as noted at http://nvd.nist.gov. AIX has approximately 0% of Critical and High Vulnerabilities and 2% of all OS vulnerabilities compared with 73% and 27% for Microsoft, respectively and 16% and 31% for Linux respectively. X-Force report – Mid-year 2010 http://www-935.ibm.com/services/us/iss/xforce/trendreports/ . A successful hack could result in just a personnel inconvenience for the IT staff, the loss of systems and/or in a worst case scenario, the theft of proprietary and/or personal data. SAP systems usually hold the crown jewels of an enterprise customer and should be among the best protected of any customer systems.

Bottom line – Where individual x86 systems may have a lower price tag than the equivalent Power System, full SAP landscapes will often require far fewer systems with Power Systems resulting in a lower TCO. Add to that much better reliability, fault detection, comprehensive problem resolution and ownership and rock solid security and the case for IBM Power Systems for SAP landscapes is pretty overwhelming.

August 15, 2011 Posted by | Uncategorized | , , , , , , , , , , , , , , , , , , , , | 6 Comments