During the management of the Exadata, we have seen numerous incidents ranging from the eviction of the nodes to the hanging cluster and from cooked controllers to bad hard disks. On all the occasions we have been engaged with the My Oracle Support to identify the root cause, which more often than not results in either a known bug or a new bug.

While working on the Service Requests, lots and lots of data needs to be gathered and sent to the Oracle Support. The volume of data becomes very huge and there are many locations from which to gather the data. This becomes a project in itself because of the high volume of data from different locations including database logs, ASM logs, CRS logs, diskmon logs, cell logs, OS Watcher logs, data gathered from sundiag and other Oracle supplied scripts and so on. The problem becomes more grave when some of the Support guys ask for the logs one by one, and given their numerous locations and volume it takes weeks to gather and upload all the data in this serial fashion.

This time which is spent on data gathering can be saved by proactively gathering and uploading the would-be-required files without waiting for Oracle Support to ask for them one by one. In fact that is what Oracle Support also suggests when you go about raising an SR, as they ask you to upload RDA report or logs etc. but in Exadata case this should come as a must best practice when raising a SR with MOS.

So whenever Exadata crashes, or gets hung, or experiences some serious issue, and you intend to work with Oracle Support for the resolution and root cause analysis of the problem, also upload the following data with your SR to save your and MOS time regarding data gathering:

— Trace directories of database and asm from all database nodes

/diag/rdbms/<database name>/<rdbms sid name>/trace
 /diag/asm/+asm/+ASM1/trace

— crsd, cssd, diskmon directories from all database nodes:

$GRID_HOME/log/<rdbms sid name>

— Operating System log from both database and cell nodes:

/var/log/messages

— OS Watcher files of the time when problem occurred from both database and cell nodes

cd /opt/oracle.oswatcher/osw/archive
find . -name '*<month>.<day of month>*' -print -exec zip /tmp/osw11/osw_`hostname`.zip {} \;

— listener.log from all the database nodes

/diag/tnslsnr/<rdbms sid name>/listener/trace

— Also collect diagnostic information through sundiag in case of disk failure:

/opt/oracle.SupportTools/sundiag.sh

— Also output the following commands from the cells:

dcli -g cell_group -l root 'service celld status'
 dcli -g cell_group -l root 'cellcli -e list physicaldisk'
 dcli -g cell_group -l root 'cellcli -e list lun'
 dcli -g cell_group -l root 'cellcli -e list celldisk'
 dcli -g cell_group -l root 'cellcli -e list griddisk'

— If the issue is about CRS then:

<GRID_HOME>/bin/diagcollection.sh

— Two scripts have been released which will collect trace files and log files from the storage cells and compute nodes.

  1. DbmCheck.sh

    Connects to all the storage cells and collects general information, like configuration parameters, celldisks, griddisks, flash cache, etc.

    To execute DbmCheck.sh:
     #cd /opt/oracle.SupportTools/onecommand
     #./DbmCheck.sh -c -v -d
     All the files are generated under directory /opt/oracle.SupportTools/onecommand/diagfiles/<date> where date is in the format YYYY_MM_DD.
  2. diagget.shThis script connects to all the compute nodes and collects log and trace files related to Grid Infrastructure, ASM, RDBMS (including all the databases), Operating System files, OSWatcher, etc.To execute diagget.sh:
    #cd /opt/oracle.SupportTools/onecommand
    #./diagget.sh -n <x> -v -d

All the files are genereated under directory /opt/oracle.SupportTools/onecommand/diagfiles/<date> where date is in the format YYYY_MM_DD.

Now the above data is humongous, you need to be methodical for yourself and for the Oracle Support while gathering and uploading that data, so that MOS doesn’t end up asking for the same data again and again and you dont end up uploading enormous amount of data again and again.

Begin with generating SR with My Oracle Support and mention to Oracle support that you are in the process of uploading the diagnostic files through ftp.

  • Create a directory in /tmp directory of your database nodes and name that directory to today’s date e.g. July7_<SR Number>
  • For database node diagnostic data create a directory database under July7_<SR Number>.
  • In /tmp/July7_<SR Number>/database directory at your database node, just copy the diagnostic files mentioned above.
  • When you run the commands or scripts at the db nodes, you can also place them in a separate directory called as commands_output in /tmp/July7_<SR Number>/database.
  • For cell node diagnostic data create a directory cell under July7_<SR Number>.

Now gather the information at each cell and compress that information using bzip and make sure to append the cell name with each gathered file or directory and then scp it to the first database node in /tmp/July7_<SR Number>/cell directory.

  • When you run the commands or scripts at the cell nodes, you should place them in a separate directory called as commands_output in /tmp/July7_<SR Number>/cell.
  • Now after gathering all the database and cell diagnostic information you have placed all the required information in the directory /tmp/July7_<SR Number>.
  • You have separated the cell and database data under one date.
  • Now you need to upload this all data through the ftp to ftp.oracle.com.
  • First tar and compress the collected information in background by doing:
    nohup tar cjf logs.tar.bz2 /tmp/July7_<SR Number>/* &
  • Now ftp the tarred and compressed file to ftp.oracle.com.
  • As the file gets uploaded, update the SR as what you have uploaded and where. Just paste the above hierarchy of files and MOS should be able to get it.
  • What you should do is get the SR number, create a directory with that SR number in /support/incoming directory at ftp.oracle.com.

Now MOS should remain busy with that amount of data for some days and if they ask for more, you just have to change the date and keep the structure the same.