Exadata Patching – Best Practices and Lessons Learned
Updated: Aug 19, 2019
“With Great Power Comes Great Responsibility”
One of the biggest ongoing responsibilities that comes after commissioning an Exadata appliance is keeping the firmware and software of the various components of the machine up to date. As you’ve probably construed, we are talking about patching an Exadata Appliance.
An Exadata appliance has three layers that requires software maintenance. The bottom and top sections of the rack hold the Exadata Storage servers followed by the compute/database nodes and lastly the InfiniBand switches as depicted from the image below.
Let us take the case of a recent client who had all its business-critical application databases running in a gamut of environments from Sandpit to End to End and Production on Exadata machines in varying configurations starting from a Quarter Rack of Exadata X4-8 in sandpit to a Full Rack of Exadata X5/6-2 in production. We successfully patched the Exadata server to the latest patchset levels as approved by the business.
Meticulous planning in patching of these components is Important to ensure a low risk change. That is; maintaining the continuity of services while also making the process as time efficient as possible and producing a predictable outcome – a successful patching process.
Based on our years of experience some of the Best Practices we employ are as follows:
1.Patching Approach: Clearly delineate the components that could be patched online and those that require a downtime. E.g. Online-Patching for Storage and InfiniBand and Outage for Db Nodes OS and Grid Infrastructure.
2. Patch Staging Area: Setup of a standard and uniform patch staging area on the first compute node of each Exadata machine being patched which contains patches for various components.
3. Proactive SR: Always open a proactive SR with Oracle well in advance of sharing your patching schedule, your patching procedure, prechecks like Exachk reports, Patchmgr precheck reports and any issues encountered. A proactive SR also ensures that Oracle support personnel have been pre-allocated and are on standby while you patch your Exadata machine.
4. Integrated Lights-out Management (ILOM) Access: Always ensure ILOM access is enabled for each component being patched i.e. Storage cells, InfiniBand Switches and Compute nodes.
5. SSH Passwordless Access: Ensure a successful passwordless SSH to each component being patched from the first compute node.
6. Exachk Reports: Ensure the latest version of the Exachk utility has been used and the health issues (if any) have been carefully been reviewed, discussed with Oracle support and fixed (where applicable) before you proceed with patching.
7. Preparing Exadata Components: Ensure that a patchmgr pre-check is run and that it is successful without any issues before the actual patching of any component like Cell Server, InfiniBand Switch and Compute Nodes.
8. Maintenance Mode: Ensure that the component being patched is in maintenance mode i.e under a Blackout to avoid unwanted notifications and repeated alerts during patching.
9. Component Patching: Only after ensuring all the above points 1 to 8 have been carried should the actual patching of the Exadata component be commenced based on the patching approach using the patchmgr utility in a non-rolling/rolling fashion as approved by the business.
10. Review Patching Outcome: Carefully review all patchmgr run output and logs. Report any unexpected errors and deviations to Oracle support using the Proactive SR.
11. Post Patching Checks: Crosscheck the imageinfo version of the various components patched for the Exadata appliance.
Ensure that a health check using the Exachk utility is run and carefully analyse and compare the Exachk report to the one taken right before the patching. Report any concerns to Oracle Support.
12. End Maintenance Mode: Ensure that maintenance mode/blackout has been ended immediately post patching.
An Exadata patching cycle is usually full of experiences that really widen your understanding of how the different components in an Exadata machine behave or may behave under varying environmental factors and the practices that are required while commissioning them.
We do have some lessons to share from our experience as well, enlisted as below:
FAILURE DURING COMPUTE NODE OS PATCHING:
While patching the OS on compute nodes of an Exadata X4-8 machine in an End to End Environment we encountered a fatal timeout issue during the actual patchmgr run that was updating the libraries at the OS level.
In the hours that followed, the immediate need was to roll back the patch and bring the DB Node OS back to its image version before patching so that the disrupted application/database services could be restored. We faced another setback as we couldn’t rollback the patch that had been partially applied by the patchmgr utility using the ‘rollback’ option.
During the next few days and after numerous hours of consulting with Oracle support experts it was identified that a custom File system layout “/var” on the Exadata machine caused the backups initially taken by the patchmgr utility to be overwritten during the actual patching cycle.
The customised filesystem layout appeared something like the below on the compute nodes:
A Filesystem standardization activity was conducted on the compute nodes of the Exadata machine and the custom layout was merged with the root “/”. The Correct file system layout would then appear as followin:
Subsequent patching and rollback attempts on the compute node succeeded.
Always be cautious of any customised system Filesystem (FS) layouts you may have on your compute nodes as this may create aberrations in the behavior of the patchmgr utility and lead to failures.
PRECHECKS FAILED FOR INFINIBAND PATCHING:
While preparing for InfiniBand patching on an Exadata X6-2 machine in the End-To-End environment the Prechecks using the patchmgr utility failed at the stage of “verifying the network topology”
This check ensures that every Cell Node and/or Compute node in the Exadata stack is redundantly connected to the available IB leaf switches.
A Sample of the exception that surfaces during the Precheck using the patchmgr utility is as follows:
On explicitly running the “Verify Topology Check” the following exceptions will be seen:
The verify topology check output also identifies the compute/cell nodes that are not consistent with the above exceptions (not shown in the above image).
The solution was an Oracle Field Engineer visit to the datacenter to fix the IB cabling. A precheck and verify-topology check for InfiniBand succeeded after the cabling on the IB switches was fixed.
It is often very worthwhile to run the following utility to check the IB topology in advance even before you plan to run pre-checks using the patchmgr utility and subsequently plan to patch your InfiniBand Switch:
In a nutshell meticulous planning and attention to small details and exceptions encountered will really payoff in the long run. This is especially true when you are planning to patch your Exadata appliance by delivering expected and predictable results in a “No Scope For Error” zone, like a production environment.
Hope we have added of some insight to your Exadata Patching outlook!
Consult with Fusion Professional to discover how our best practices can ensure your Exadata Patching is done the right way the first time with no unexpected results or down time. Let our experience work for you.