Tuesday, July 3, 2018

Unable to open device "/dev/sdab1": Device or resource busy during oracleasm disk creation.


Resource or device busy issue when creating the ASM disks at OS level usign RAW disks provided from NETAPP Storage.


Issue:- Oracle ASM disks labeling failed for one of the Multi path RAW device which was provided from NETAPP storage as below

All the provided LUNs got mounted on DB Servers using below commands.

echo "- - -" > /sys/class/scsi_host/host1/scan
echo "- - -" > /sys/class/scsi_host/host2/scan

Once the disks got appeared at OS level in both nodes (2-Node RAC DB Server) the actual issue encountered.

[root@ebsdbser01 mapper]# fdisk -l /dev/sds
Disk /dev/sds: 214.7 GB, 214748364800 bytes
255 heads, 63 sectors/track, 26108 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Disk identifier: 0x40402c59

   Device Boot      Start         End      Blocks   Id  System
/dev/sds1               1       26108   209712478+  83  Linux
Partition 1 does not start on physical sector boundary.

[root@ebsdbser01 mapper]# fdisk -l /dev/sdab

Disk /dev/sdab: 214.7 GB, 214748364800 bytes
255 heads, 63 sectors/track, 26108 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 65536 bytes
Disk identifier: 0x40402c59

    Device Boot      Start         End      Blocks   Id  System
/dev/sdab1               1       26108   209712478+  83  Linux
Partition 1 does not start on physical sector boundary.

[root@ebsdbser01 mapper]# oracleasm createdisk FLASH01 /dev/sdab1

Unable to open device "/dev/sdab1": Device or resource busy



Work Around:-

As a work around, the Mapper device was taken for logical partitioning (fdisk) instead of its related one of the raw device path and then partprobe was used after partitioned. Finally all the dependent mutipaths got logically partitioned. And the ASM DISKS labelling successfully completed on both 2 node DB servers.

[root@ebsdbser01 mapper]# multipath -ll mpathe
mpathe (3600a098038303778462b49533170412f) dm-14 NETAPP,LUN C-Mode
size=200G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 1:0:1:200 sds  65:32  active ready running
| `- 2:0:0:200 sdab 65:176 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 1:0:0:200 sdj  8:144  active ready running
  `- 2:0:1:200 sdak 66:64  active ready running
 
[root@ebsdbser01 mapper]# partprobe /dev/mapper/mpathe

[root@ebsdbser01 mapper]# ls -ltr /dev/mapper/mpathe*
lrwxrwxrwx. 1 root root 8 Jul  3 15:46 /dev/mapper/mpathep1 -> ../dm-24
lrwxrwxrwx. 1 root root 8 Jul  3 15:46 /dev/mapper/mpathe -> ../dm-14

[root@ebsdbser01 mapper]# oracleasm createdisk FLASH01 /dev/mapper/mpathep1
Writing disk header: done
Instantiating disk: done

[root@ebsdbser01 mapper]# oracleasm querydisk -d FLASH01
Disk "FLASH01" is a valid ASM disk on device [251,24]

[root@ebsdbser01 mapper]# oracleasm querydisk -p FLASH01
Disk "FLASH01" is a valid ASM disk
/dev/sdj1: LABEL="FLASH01" TYPE="oracleasm"
/dev/sds1: LABEL="FLASH01" TYPE="oracleasm"
/dev/sdab1: LABEL="FLASH01" TYPE="oracleasm"
/dev/sdak1: LABEL="FLASH01" TYPE="oracleasm"
/dev/mapper/mpathep1: LABEL="FLASH01" TYPE="oracleasm"

Tuesday, June 12, 2018

Database Incomplete recovery lead to ORA-00600: internal error code, arguments: [13013] ... On SYS related CLUSTER table.

An Oracle database hosting SAP SLT Application (SAP Landscape Transformation) got corrupted due to unexpected crash of ASM disk groups.

The most corrupted disks were related to FRA where the ORL and Archive logs placed upon.



When Oracle Support confirmed that these disks are no more in a state of recovery, One of my buddy Database Administrator Performed the  Incomplete Recovery.Post of incomplete DB recovery, all were in an assumption that everything is working fine.  



But The SAP SLT application users started complaining as below in their own words. 

"We are seeing lot of db oracle related error messages in the application level, which hampers almost all activity in the system. Kindly check and do the needful."
Then myself started investigating. In my initial investigation itself  found that below error was being triggered since above mentioned Incomplete DB recovery. 

The Oracle Internal Error encountered as below.

ORA-00600: internal error code, arguments: [13013], [5001], [344], [8485480], [0], [8485480], [17], [], [], [], [], []


Oracle says if ORA-600 [13013] is due to corrupted Indexes and the following optional arguments should be considered as below(7 optional arguments), continued my investigation by matching the arguments as below.


Format: ORA-600 [13013] [a] [b] [c] [d] [e] [f]


    Arg [a] Passcount -- 5001

    Arg [b] Data Object number -- 344
    Arg [c] Tablespace Decimal Relative DBA (RDBA) of block containing the row to be updated -- 8485480
    Arg [d] Row Slot number -- [0]
    Arg [e] Decimal RDBA of block being updated (Typically same as [c]) -- [8485480]
    Arg [f] Code -- [17]

When i tried to catch  the problematic table using corrupted Indexes,  i found that is not a table but one of the SYS schema CLUSTER object.


A CLUSTER Object is a group of tables that share common columns and store related data in the same blocks, 

  • Identification of DB Object affected by above corruption.

SET LINESIZE 200
COL OBJECT_NAME FORMAT A10
COL OWNER FORMAT A15
SET LINESIZE 200
SELECT OBJECT_NAME,OWNER,OBJECT_ID,OBJECT_TYPE FROM DBA_OBJECTS WHERE OBJECT_ID=344;
OBJECT_NAME                    OWNER            OBJECT_ID OBJECT_TYPE
------------------------------ --------------- ---------- -----------------------
SMON_SCN_TO_TIME_AUX           SYS                    344 CLUSTER

The above cluster object is build up on the columns from a SYS table called SMON_SCN_TIME. I followed one odd way to find out the table building this cluster object by analyzing all the SYS related tables matching above CLUSTER name. Then i found the exact table and encountered the real issue when validating its structure.

  • Table affected by the above corruption


SQL> ANALYZE TABLE SYS."SMON_SCN_TIME" VALIDATE STRUCTURE CASCADE;

ERROR at line 1:
ORA-01499: table/index cross reference failure - see trace file

  • Issues got resolved post of rebuilding the corrupted index rebuilding below.

SQL> SELECT OWNER,INDEX_NAME FROM DBA_INDEXES WHERE TABLE_NAME='SMON_SCN_TIME';

OWNER      INDEX_NAME

---------- -------------------------
SYS        SMON_SCN_TIME_TIM_IDX
SYS        SMON_SCN_TIME_SCN_IDX

SQL> ANALYZE INDEX SMON_SCN_TIME_TIM_IDX VALIDATE STRUCTURE;


Index analyzed.


SQL> ANALYZE INDEX SMON_SCN_TIME_SCN_IDX VALIDATE STRUCTURE;


Index analyzed.


SQL> ALTER INDEX SMON_SCN_TIME_TIM_IDX REBUILD ONLINE;


Index altered.


SQL> ALTER INDEX SMON_SCN_TIME_SCN_IDX REBUILD ONLINE;


Index altered.


Finally After rebuilding above corrupted indexes got recovered and i could able to re-analyze the problematic table without any issues.



SQL> ANALYZE TABLE SYS."SMON_SCN_TIME" VALIDATE STRUCTURE CASCADE;

Table analyzed.


Post of above the  Oracle Internal error got disappeared from alert log and No complaints from Application user.


Happy reading...

Oracle RAC node unavailable with error: Server unexpectedly closed network connection6]clsc_connect: (0x251c670) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_))

 Early midnight I received a call from the monitoring team that one of the critical production database node is not available. As I am aware...