Oracle Exadata for SAP revisited
Oracle’s Exadata, Exalogic and Exalytic systems have failed to take the market by storm but that has not stopped Oracle from pushing them as much as possible at every opportunity. Recently, an SAP customer started to investigate the potential of an Exadata system for a BW environment. I was called in to explain the issues surrounding such an implementation. A couple of disclaimers before I start; I am not an Oracle expert nor have I placed hands on an Exadata system, so what I present here is the result of my effort to get educated on this topic. Thanks go to some brilliant people in IBM that are incredible Oracle and SAP experts and whose initials are R.B., M.C., R.K. and D.R.
My first question is: why would any customer implement BW on a non-strategic platform such as Exadata when BW HANA is available? Turns out, there are some reasons, albeit a little of a stretch. Some customers may feel that BW HANA is immature and lacks the ecosystem and robust tools necessary to utilize in production today. This is somewhat valid and, from my experience, many customers tend to wait a year or so after V1.0 of any product to consider it for production. That said, even prior to the GA of BW HANA, SAP has reported that HANA sales were very strong, presumably for non-BW purposes. Some customers may be abandoning the V1.0 principle in some cases which makes sense for many HANA environments where there may be no other way or very limited ways of accomplishing the task at hand, e.g. COPA. The jury is out on BW HANA as there are valid and viable solutions today including BW with conventional DBs and BWA. Another reason revolves around sweetheart deals where Oracle gives 80% or larger discounts to get the first footprint in a customer’s door. Of course, sweetheart deals usually apply only for the first installation, rarely for upgrades or additional systems which may result in an unpleasant surprise at that time. Oracle has also signed a number of ULAs (Unlimited License Agreement) with some customers that include an Exadata as part of that agreement. Some IT departments have learned about this only when systems actually arrived on their loading docks, not always something they were prepared to deal with.
Beside the above, what are the primary obstacles to implementing Exadata? Most of these considerations are not limited to SAP. Let’s consider them one at a time.
Basic OS installation and maintenance. Turns out that despite the system looking like a single system to the end user, it operates like two distinct clusters to the administrator and DBA. One is the RAC database cluster, which involves a minimum of two servers in a quarter rack of the “EP” nodes or full rack of “EX” nodes and up to 8 servers in a full rack of the “EP” nodes. Each node must not only have its own copy of Oracle Enterprise Linux, but a copy of the Oracle database software, Oracle Grid Infrastructure (CRS + ASM) and any Oracle tools that are desired, of which the list can be quite significant. The second is the storage cluster, which involves a minimum of 3 storage serves for a quarter rack, 7 for a half rack and 14 for a full rack. Each of these nodes has its own copy of Oracle Enterprise Linux and Exadata Storage software. So, for a half rack of “EP” nodes, a customer would have 4 RAC nodes, 7 storage nodes + 3 Infiniband switches which may require their own unique updates. I am told that the process for applying an update is complex, manual and typically sequential. Updates typically come out about once a month, sometimes more often. Most updates can be applied while the Exadata server is up, but storage nodes, must be brought down, one at a time, to apply maintenance. When a storage node is taken down for maintenance, apparently data may not be present, i.e. it must be wiped clean which means that after a patchset is applied the data must be copied from one of its ASM created copies.
The SAP Central Instance may be installed on an Exadata server, but if this is done, several issues must be considered. One, the CI must be installed on every RAC node, individually. The same for any updates. When storage nodes are updated, the SAP/Exadata best practices manual states that the CI must be tested after the storage nodes are updated, i.e. you have to bring down the CI and consequently must incur an outage of the SAP environment.
Effective vs. configured storage. Exadata offers no hardware raid for storage, only ASM software based RAID10, i.e. it stripes the data across all available disks and mirrors those stripes to a minimum of one other storage server unless you are using SAP, in which case, the best practices manual states that you must mirror across 3 storage servers total. This offers effectively the same protection as RAID5 with a spare, i.e. if you lose a storage server, you can fail over access to the storage behind that storage server which in turn is protected by a third server. But, this comes at the cost of the effective amount of storage which is 1/3 of the total installed. So, for every 100TB of installed disks, you only get 33TB of usable space compared to RAID5 with a 6+1+1 configuration which results in 75TB of usable space. Not only is the ASM triple copy a waste of space, but every spinning disk utilizes energy and creates heat which must be removed and increases the number of potential failures which must be dealt with.
Single points of failure. Each storage server has not one, but over a dozen single points of failure. The infiniband controller, the disk controller and every single disk in the storage server (12 per storage server) represent single points of failure. Remember, data is striped across every disk which means that if a disk is lost, the stripe cannot be repaired and another storage server must fulfill that request. No problem is you usually have 1 or 2 other storage servers to which that data has been replicated. Well, big problem in that the tuning of the system striped across not just the disks within a storage server, but across all available storage servers. In other words, while a single database request might access data behind a single storage server, complex or large requests will have data spread across all available storage servers. This is terrific in normal operations as it optimizes parallel read and write operations, but when a storage server fails and another picks up its duties, the one that picks up its duties now has twice the amount of storage it must manage resulting in more contention for its disks, cache, infiniband and disk controllers, i.e. the tuning for that node is pretty much wiped out until the failed storage node can be fixed.
Smart scans, not always so smart. Like many other specialized data warehouse appliance solutions, including IBM’s Netezza, Exadata does some very clever things to speed up queries. For instance, Exadata uses a range “index” to describe the minimum and maximum values for each column in a table for a selected set of rows. In theory, this means that if a “where” clause requests data that is not contained in a certain set of rows, those rows will not be retrieved. Likewise, “Smart scan” will only retrieve columns that are requested, not all columns in a table for the selected query. Sounds great and several documents have explained when and why this works and does not work, so I will not try to do so in this document. Instead, I will point out the operational difficulties with this. The storage “index” is not a real index and works only with a “brute force” full table scan. It is not a substitute for an intelligent partitioning and indexing strategy. In fact, the term that Oracle uses is misleading as it is not a database index at all. Likewise, smart scans are brute force full table scans and don’t work with indexes. This makes them useful for a small subset of queries that would normally do a full table scan. Neither of these are well suited for OLTP as OLTP deals, typically, with individual rows and utilizes indexes to determine the row in question to be queried or updated. This means that these Exadata technologies are useful, primarily, for data warehouse environments. Some customers may not want
So, let’s consider SAP BW. Customers of SAP BW may have ad-hoc queries enabled where data is accessed in an unstructured and often poorly tuned way. For these types of queries, smart scans may be very useful. But those same customers may have dozens of reports and “canned” queries which are very specific about what they are designed to do and have dozens of well constructed indexes to enable fast access. Those types of queries would see little or no benefit from smart scans. Furthermore, SAP offers BWA and HANA that do an amazing job of delivering outstanding performance on ad-hoc queries.
Exadata also uses Hybrid Columnar Compression (HCC), which is quite effective at reducing the size of tables, Oracle claims about a 10 to 1 reduction. This works very well at reducing the amount of space required on disk and in the solid state disk caches, but at a price that some customers may be unaware of. One of the “costs” is that to enable HCC, processing must be done during construction of the table meaning that the time required to import data may take substantially longer. Another “cost” is the voids that are left when data is inserted or deleted. HCC works best for infrequent bulk load updates, e.g. remove the entire table and reload it with new data, not daily or more frequent inserts and deletes. In addition the voids that it leaves, for each insert, update or delete, the “compression unit” (CU) must first be uncompressed and then recompressed with the entire CU written out to disk as the solid state caches are for reads only. This can be a time consuming process, once again making this technology unsuitable for OLTP much less for DW/BW databases with regular update processes. HCC is unique to Exadata which means that data which is backed up from an Exadata system may only be recovered to an Exadata system. That is fine is Exadata is the only type of system used but not so good if a customer has a mixed environment with Exadata for production and, perhaps, conventional Oracle DB systems for other purposes, e.g. disaster recovery.
Speaking of backup, it is interesting to note that Oracle only supports their own Infiniband attached backup system. The manuals state that other “light weight” backup agents are supported but apparently, third parties like Tivoli Storage Manager, EMC’s Legato Networker or Symantec Netbackup are not considered “light weight” and consequently, not supported. Perhaps you use a typical split mirror or “flash” backup image that allows you to attach a static copy of the database to another system for backup purposes with minimal interruption to the production environment. This sort of copy is often kept around for 24 hours in case of data corruption allowing for a very fast recovery. Sorry, but not only can’t you use whatever storage standard that you may have in your enterprise since Exadata has its own internal storage, but you can’t use that sort of backup methodology either. Same goes for DR, where you might use storage replication today. Not an option and only Oracle DataGuard is supported for DR.
Assuming you are still unconvinced, there are a few other “minor” issues. SAP is not “RAC aware”, as has been covered in a previous blog posting. This means that Exadata performance is limited by two factors, i.e. a single RAC node represents the maximum possible capacity for a given transaction or query, no parallel queries are issued by SAP. Secondly, if data that is requested by OLTP transaction, such as may be issued by ECC or CRM, unless the application server that is uniquely associated with a particular RAC node requests data that is hosted in that particular RAC node, data will have to be transferred across the infiniband network within the Exadata system at speeds that are 100,000 times slower than local memory accesses. Exadata supports no virtualization meaning that you have to go back to a 1990s concept of separate systems for separate purposes. While some customers may get “sweetheart” deals on the purchase of their first Exadata system, unless customers are unprecedentedly brilliant negotiators, and better than Oracle at that, it is unlikely that these “sweetheart” conditions are likely to last meaning that upgrades may be much more expensive than the first expenditure. Next is the granularity. An Exadata system may be purchase in a ¼ rack, ½ rack or full rack configuration. While storage nodes may be increased separately from RAC nodes, these upgrades are also not very granular. I spoke with a customer recently that wanted to upgrade their system from 15 cores to 16 on an IBM server. As they had a Capacity on Demand server, this was no problem. Try adding just 6.25% cpu capacity to an Exadata system when the minimum granularity is 100%!! And the next level of granularity is 100% on top of the first, assuming you went from ¼ to ½ to full rack.
Also consider best practices for High Availability. Of course, we want redundancy among nodes, but we usually want to separate components as much as possible. Many customers that I have worked with place each node in an HA cluster in separate parts of their datacenter complex, often in separate buildings on their campus or even geographic separation. A single Exadata system, while offering plenty of internal redundancy, does not protect against the old “water line” break, fire in that part of the datacenter, or someone hitting the big red button. Of course, you can add that by adding another ¼ or larger Exadata rack, but that comes with more storage that you may or may not need and a mountain of expensive software. Remember, when you utilize conventional HA for Oracle, Oracle’s terms and conditions allow for your Oracle licenses to transfer, temporarily, to that backup server so that additional licenses are not required. No such provision exists for Exadata.
How about test, dev, sandbox and QA? Well, either you create multiple separate clusters within each Exadata system, each with a minimum of 2 RAC nodes and 3 storage nodes, or you have to combined different purposes together and share environments between environments that your internal best practices suggest should be separated. The result is either multiple non-prod systems or larger systems with considerable excess capacity may be required. Costs, of course, go up proportionately or worse, may not be part of the original deal and may receive a different level of discounts. This compares to a virtualized Power Systems box which can host partitions for dev, test, QA and DR replication servers simultaneously and without the need for any incremental hardware, beyond memory perhaps. In the event of a disaster declaration, capacity is automatically shifted toward production but dev, test and QA don’t have to be shut down, unless the memory for those partitions is needed for production. Instead, those partitions, simply get the “left over” cycles that production does not require.
Bottom line: Exadata is largely useful only for infrequently updated DW environments, not the typical SAP BW environment, provides acceleration for only a subset of typical queries, is not useful for OLTP like ECC and CRM, is inflexible lacking virtualization and poor granularity, can be very costly once a proper HA environment is constructed, requires non-standard and potentially duplicative backup and DR environments, is a potential maintenance nightmare and is not strategic to SAP.
I welcome comments and will update this posting if anyone points out any factual errors that can be verified.
I just found a blog that has a very detailed analysis of the financials surrounding Exadata. It is interesting to note that the author came to similar conclusions as I did, albeit from a completely different perspective. http://wikibon.org/wiki/v/The_Limited_Value_of_Oracle_Exadata
so why are there so many customers running sap?
Would you mind restating the question? In all of my postings, I have not said anything about whether customers should run SAP or not, simply what infrastructure or database solution might be most beneficial to them. I make my living selling systems for SAP, so I hope that SAP continues to sell more and more solutions to customers.
Hi,
You need to make a distinction between RAC and Exadata. By making baseless claims that SAP will not benefit from RAC, you are making a mockery of the many SAP on Power sites that are running SAP on IBM AIX + Oracle RAC successfully and happily.
Please target your comments more accurately. Do not throw the baby away with the bath water.
You are correct in the distinction that Exadata is different than RAC in that Exadata uses RAC only for the compute nodes, not for the storage nodes plus includes a number of other features unique to Exadata. Regarding “baseless claims”, as you did not note any, it is unclear to what you are referring. Exadata most definitely presents the database to the external world, including to application servers, as a RAC cluster. SAP application servers can be associated with only one RAC node at any time. As a result, all requests are sent to that node. As the DB is actually a RAC cluster, any data not present in the memory of that node will be accessed from another node, if present in its memory, or from the disk subsystem. Transfers across the Exadata infiniband network have a lower latency than a 10gb Ethernet network commonly used for RAC, but any network transfer is going to be orders of magnitude slower than access to local memory. As a result, “successful” SAP RAC implementations utilize one of two methodologies: 1) An active/passive arrangement where the second node is used essentially for very fast HA failover. 2) Very careful assignment of users to application servers and/or separation of dialog and batch nodes where most data accessed is common to that group of users or batch programs so as to minimize cross talk across the RAC network. The first requires little tuning but the benefit is limited to failover. The second is very complex and may still end up not delivering the benefits expected since, inevitably, data or locks often ends up in different nodes than would be optimal. In addition, if a failure of a node occurs, all of the excellent tuning that might have been done, may be rendered ineffective. While SAP does not publish this sort of data, several sources have told me that the number of RAC installations for SAP numbers less than 100 across all sizes of customers and geographies.
There are many factual errors in this docuemnt for example.
1) The storage server is wiped out when being updated. this is not correct. In fact only the changes are resynchronised when the patched storage node comes back up.
2) Exadata installation is done by oracle hence there is no job for the customer. you can find youtube videos where setting up a full exadata system takes less than 8 hours from the time the box hits the loading dock.
The OS is preinstalled and the database software is installed by the oracle engineers.
3)Oracle provides premium support via which they will do the updating and patching of your systems for you
4)ASM is not raid but SAME ( stripe and mirror everything) it provides mirroring of stripes in a database aware fashion.Thus performance of ASM is far better than RAID-5
5) netbackup is supported with exadata and you can find a whitepaper from symantec on the internet.
6) the infiniband network Is actually as fast as the PCI bus within a single chassis , hence there is negligible wastage due to block pings. you can check the oracle AWR reports for cache fusion statistics.
7) DBRM and IORM allow the same exadata rack to be shared without the need to go for virtualization and provide similar capabilities as IBM Lpars in terms of resource sharing.
8)a storage server is not a single point of failure since loss of a storage server is not a failure. Within a storge serve loss of any disk is not a failure. The exadata system already keeps backups of asm disks and hence loss of a disk and its regeneration should not cause additional stress on the system.
Thank you for your comments. I appreciate it when readers raise issues that test the limits of my knowledge and therefore encourage me to analyze what I have written and do additional research when necessary. I will respond to each concern in the same order as raised.
1) Both of us are correct. In some situations, a patch may affect only CPU or Flash memory algorithms in which case, a Fast Mirror Resync may work. This assumes, of course, that the patch has taken less time than the timeout specified in disk_repair_time. In other situations, it is possible that the update affects the way in which disks are accessed or stripes are laid out. In this situation, a full resync may be required.
2) The point that I was attempting was more focused on the effort of managing the system, not installing it. By the way, it may be “easy” to install a system, but configuring it for SAP is an entirely different matter. It is a RAC cluster and RAC clusters take special handling with SAP.
3) Yes, you can hire a company to do all of the work for you but many customers do not as the “premium” service is not cheap.
4) I did not refer to performance of ASM in my blog posting. Striping almost always provides faster access than RAID5, but my point was that the ASM method takes far more disk space. By the way, RAID 0+1 is also stripe and mirror everything, so the difference that I think you are saying is that ASM does not do hardware RAID, a point that we are in violent agreement about.
5) Agreed, Netbackup is now supported (but was not at the time that I wrote this blog posting), but it uses an external Ethernet network interface, not the infiniband network. Using the network is far more resource intensive and takes radically longer than creating a flash copy or similar mirrored images which can be disconnected from one system and connected to another in order to take your time making the backup with no impact to the production system and with the ability to do an extremely fast restore in the event required.
6) Infiniband is indeed very fast, but it is orders of magnitude slower than memory access speeds. A single image system will therefore offer much faster access to all memory, not just that which is contained within a given node.
7) There are only three ways in which an Exadata rack may be shared. 1) By allocating nodes to different RAC clusters, minimum of 2 RAC nodes, 3 Storage nodes per cluster 2) By starting up multiple RAC instances on each compute node in a cluster thereby sharing the OS and 3) By having more than one database instance within a RAC instance. Each has repercussions. In the first instance, there is no sharing at all with dedicated resources for each cluster resulting in very inefficient use of resources. In the second and third, all DB instances share the same set of nodes and, consequently, OS instances. This is indeed sharing, but means that any time updates are applied, all must be QA’d and upgraded simultaneously, not the “one at a time” approach favored by many if not most customers. None of these is even remotely similar to LPARS in which each LPAR has its own OS, but can share CPU resources with all other LPARS on a given server based on priorities and entitlements, can share I/O and network via one or more Virtual I/O servers, can even share memory via Active Memory Sharing and can be transparently moved to another server via Live Partition Mobility. Not only do LPARS offer radically more flexibility and resource sharing, but the lack of a shared OS means that you can upgrade one DB instance at a time.
8) Agreed and also pointed out on my posting that ASM requires two (normal redundancy) to three mirrors (high redundancy and required by the SAP Best Practices guide), so the loss of one mirror does not result in an outage. But a disk failure absolutely causes a storage node to fail as all disks in a storage node are considered to be part of the “Failure Group”. As you pointed out earlier, ASM utilizes stripes. The loss of a disk means that the stripe is missing a piece. In theory, you could rebalance that stripe among surviving disks and return it to operation but it might be operationally a poor choice since you really would want to take that node offline and repair the failing disk anyway, not to mention that the layout of different storage nodes would no longer be exact copies of each other.