Sunday, August 30, 2020

Oracle RAC node unavailable with error: Server unexpectedly closed network connection6]clsc_connect: (0x251c670) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_))

 Early midnight I received a call from the monitoring team that one of the critical production database node is not available.

As I am aware that this DC has power issues most of the time, I expected this will be ok when all gets up after power is on. But still, the problem continued with frequent node evictions. 

The Oracle Cluster is up in only one node and the other node yet facing the issue. 

In my initial validation, I make sure all the shared storage is up, time is in sync, then further Using below output quickly found that the issue is with Cluster interconnect communication issue.

oracle@node1 ~]$ ps -ef| egrep 'crsd.bin|ocssd.bin|evmd.bin' | grep -v grep
oracle   11815 11809  0 12:20 ?        00:00:00 /u01/app/crs/bin/evmd.bin
root     11929 10953  3 12:20 ?        00:01:07 /u01/app/crs/bin/crsd.bin reboot
oracle   12641 12148  0 12:20 ?        00:00:06 /u01/app/crs/bin/ocssd.bin

[oracle@node2 ~]$ ps -ef| egrep 'crsd.bin|ocssd.bin|evmd.bin' | grep -v grep
oracle   11508 11506  0 12:31 ?        00:00:00 /u01/app/crs/bin/evmd.bin
root     11661 10700  3 12:31 ?        00:00:47 /u01/app/crs/bin/crsd.bin reboot 

To make my anticipation true the below alert was also pointing to same issue which is indirectly related to the Interconnectivity issue.

vi /u01/app/crs/log/node2/crsd/crsd.log

Server unexpectedly closed network connection6]clsc_connect: (0x251c670) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_))

2020-08-30 09:52:42.975: [ CSSCLNT][2274735840]clsssInitNative: connect failed, rc 9

2020-08-30 09:52:42.975: [  CRSRTI][2274735840]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..

[oracle@node1 ~]$ ping node2-priv
PING node2-priv..cstt.gov (172.16.0.2) 56(84) bytes of data.
From node1-priv..cstt.gov (172.16.0.1) icmp_seq=2 Destination Host Unreachable
From node1-priv..cstt.gov (172.16.0.1) icmp_seq=3 Destination Host Unreachable
From node1-priv..cstt.gov (172.16.0.1) icmp_seq=4 Destination Host Unreachable

Solution:
The break between cluster interconnectivity is the culprit here. Solving the private network issue would resolve the problem. 

In our case, the Private ethernet cards are up and active on both nodes but unable to communicate via private IPs. 
Seems strange for us but when the physical inspection was done in DC it was found the problem in Network cable and a physical port. 
As soon as the physical issue is resolved we have rebooted the server which came up successfully. Hence the problem is resolved.

1 comment:

  1. Excellent Shafi. Thanks for sharing this. I hope this would be helpful for many people during troubleshooting node failures . Keep sharing many more such issues .👍

    ReplyDelete

Oracle RAC node unavailable with error: Server unexpectedly closed network connection6]clsc_connect: (0x251c670) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_))

 Early midnight I received a call from the monitoring team that one of the critical production database node is not available. As I am aware...