Saturday, March 8, 2014

NFS locking issue while data pump export - Linux-x86_64 Error: 37: No locks available

Yesterday, as usual the cron job triggered a datapump export job against a database on a Linux Server.
Immediately post running the export job it got failed. When i look into the dump logfile i found below sort of errors.

Export: Release 11.2.0.3.0 - Production on Sat Mar 8 05:53:37 2014

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
;;;
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning and Automatic Storage Management options
ORA-39000: bad dump file specification
ORA-31641: unable to create dump file "/oraexp/NTLSNDB/Data_Pump_Export_NTLSNDB_FULL_030814_0553_01.dmp"
ORA-27086: unable to lock file - already in use
Linux-x86_64 Error: 37: No locks available
Additional information: 10

I verified at database level whether the dump directory, its path and the proper read & write privileges are granted on the directory. Yes everything was fine at database end.

I believed this could be an issue of nfs mount option at OS level. We are using an NFS shared mount point for all of the servers it needs to get mounted with proper options on each server to get it used by the database. I could see this mount point is mounted properly with recommended options by Oracle.

Then i checked the logs at OS level, then i found the issue is related to nfslock services. The nfslock service is not running on this database.  this service helps the client to lock a file in the related NFS mount point on the server to create a file and make write operations.

>cat messages | grep lockd
Mar  8 04:03:31 demoserver kernel: lockd: cannot monitor 10.207.80.179
Mar  8 04:03:31 demoserver kernel: lockd: failed to monitor 10.207.80.179
Mar  8 04:20:27 demoserver kernel: lockd: cannot monitor 10.207.80.179
Mar  8 04:20:27 demoserver kernel: lockd: failed to monitor 10.207.80.179

Further i came to know that t the server got rebooted couple of days ago for a reason, after reboot the nfslock services did not startup automatically. So manually we started the services. Note that  If the nfslock services need to get auto start after a reboot then we need to use chkconfig nfslock on. Later the same has been taken care. hence onwards whenever the server gets rebooted the nfslock services will automatically startup.

cat messages | grep rpc
Mar  8 07:01:43 demoserver rpc.statd[12667]: Version 1.0.9 Starting
Mar  8 07:01:49 demoserver rpc.statd[12667]: Caught signal 15, un-registering and exiting.
Mar  8 07:01:49 demoserver rpc.statd[12745]: Version 1.0.9 Starting

You can manage the nfslock services by below commands.

service nfslock status
service nfslock start
service nfslock stop

After making sure that the services got started and the client could able to lock the file on the NFS file system on the server. we re-triggered the export job. It executed successfully.



Wednesday, March 5, 2014

Killing process in Unix

 To check running process in Unix,

Command- ps –ef

Here we can use “grep” option to find out any particular process,

Example-

To find out running processes for apache,

root@sunpstsrv01# ps -ef | grep http



webservd   587   584   0   Sep 01 ?           0:00 /opt/csw/apache2/sbin/httpd -k start

    root   584     1   0   Sep 01 ?           0:47 /opt/csw/apache2/sbin/httpd -k start

  nobody  1498  1494   0   Sep 01 ?           0:00 /usr/local/apache2/bin/httpd -k start

webservd   586   584   0   Sep 01 ?           0:00 /opt/csw/apache2/sbin/httpd -k start

webservd   588   584   0   Sep 01 ?           0:00 /opt/csw/apache2/sbin/httpd -k start

nobody  8860  1494   0   Sep 02 ?           0:00 /usr/local/apache2/bin/httpd -k start

  nobody  1499  1494   0   Sep 01 ?           0:00 /usr/local/apache2/bin/httpd -k start

  nobody  8861  1494   0   Sep 02 ?           0:00 /usr/local/apache2/bin/httpd -k start

  nobody  1500  1494   0   Sep 01 ?           0:00 /usr/local/apache2/bin/httpd -k start

  nobody  1501  1494   0   Sep 01 ?           0:00 /usr/local/apache2/bin/httpd -k start

  nobody  1502  1494   0   Sep 01 ?           0:00 /usr/local/apache2/bin/httpd -k start

  nobody  2832  1494   0   Sep 01 ?           0:00 /usr/local/apache2/bin/httpd -k start

webservd  6031   584   0   Sep 01 ?           0:00 /opt/csw/apache2/sbin/httpd -k start





To find out parent & child processes in unix.



Command- ptree- To print process tree



Example-

root@sunpstsrv01# ptree 8860

1494  /usr/local/apache2/bin/httpd -k start

                 8860  /usr/local/apache2/bin/httpd -k start



Here in above example we took any process id “8860” and used ptree command, we can see pid “1494” is a parent process for child process “8860”


Using parent PID we can get all running child processes id’s.


Example-



root@sunpstsrv01# ptree 1494

1494  /usr/local/apache2/bin/httpd -k start              

                1498  /usr/local/apache2/bin/httpd -k start

                1499  /usr/local/apache2/bin/httpd -k start

                1500  /usr/local/apache2/bin/httpd -k start

                1501  /usr/local/apache2/bin/httpd -k start

                 1502  /usr/local/apache2/bin/httpd -k start

                2832  /usr/local/apache2/bin/httpd -k start

                8860  /usr/local/apache2/bin/httpd -k start

                8861  /usr/local/apache2/bin/httpd -k start

                8862  /usr/local/apache2/bin/httpd -k start



Here we can see all child PID’s associated with Parent process ID “1494”


To kill Parent & child process,


Command- kill -9 ‘PID’
 

Example-



To kill apache process,



root@sunpstsrv01# kill -9 1494
 

Here we are killing parent process running for apache.


Most of the time if we killed parent process then child process associated with that gets killed.

We can confirm that by using “ps –ef “ command.

  

Zombie process in Unix

 It is a process that has completed execution but still has an entry in the process table, allowing the process that started it to read its exit status.



When a process ends, all of the memory and resources associated with it are de-allocated so they can be used by other processes. However, the process entry in the process table remains. The parent is sent a SIGCHLD signal indicating that a child has died; the handler for this signal will typically execute the wait system call, which reads the exit status and removes the zombie.


Zombies can be identified in the output from the UNIX ps command by the presence of a “Z” in the STAT column.
 

Example-


ps -el | grep 'Z'

With a normal ps -el command you see an output with in the second colum the state of the process. Here are some states:



S : sleeping

R : running

D : waiting (over het algemeen voor IO)

T : gestopt (suspended) of getrasseerd

Z : zombie (defunct)

The output under this text is an example. We can see that dovecot-auth is the zombie.
 

[root@s324 /]# ps -el | grep 'Z'

F S   UID   PID  PPID  C PRI  NI ADDR    SZ WCHAN  TTY          TIME CMD

1 Z     0  1213   589  0  75   0    -     0 funct> ?        00:00:00 dovecot-auth

Here 2nd column “Z” indicates zombie process.
 

Most of the time zombie process can be killed by “kill -9 ‘Zombie PID’” but still if that zombie process is not being killed then we might need to restart that application related to process.

Oracle RAC node unavailable with error: Server unexpectedly closed network connection6]clsc_connect: (0x251c670) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node2_))

 Early midnight I received a call from the monitoring team that one of the critical production database node is not available. As I am aware...