watchdog
Package: WA2L/edrc 1.5.57
Section: General Commands (1)
Updated: 09 April 2008
Index
Return to Main Contents
NAME
watchdog - a watchdog to check a condition and react on it
SYNOPSIS
edrc/bin/watchdog
[
-h
]
watchdog
-n name
-c check_script
-b bite_script
[
-i c_int
[,
b_int
] ] [
-r l_int
]
watchdog
-s name
watchdog
-l
|
-L
{
[
start
] |
stop
}
AVAILABILITY
WA2L/edrc
DESCRIPTION
The intention of the
watchdog
command is to easily monitor a certain condition or service and
react if the condition is not OK.
The states an terminology of the watchdog is adapted from the
behavior of a real dog which checks, waits (between checks),
bites (if something is not OK) and breaths (to recover between bites).
The different possible states of all running watchdogs are logged
to a logfile.
The check script has to be written that way that a successful
check returns an exit code of 0. An unsuccessful check has to
return an exit code <> 0.
The interval between check script starts (= check interval) can
be specified with the
-i
option.
The bite script (= reaction to an unsuccessful check) has to
be written that way that a successful bite returns an exit
code of 0. An unsuccessful bite should return an exit
code <> 0 to ensure a direct bite script restart.
The interval between bite script restarts (= bite interval) can
be specified with the
-i
option.
The bite script and the check script must exist on startup of
a watchdog.
If a bite- or check script disappears during operation the number
of retries can be configured in the configuration file
watchdog.cfg
with the
RETRIES_ON_MISSING_SCRIPT
setting.
The watchdog has the following states which are also recorded to
the logfile:
- check
-
This is also the first state after the start of a watchdog.
Execute the check script. If the check script returns an exit code
of 0 goto state "wait", else goto state "bite".
To avoid logfile fill ups of subsequent "check" state recordings,
the "check" state is logged in a bigger interval, which can be
specified with the
-r
option if there is a need to deviate from the default of an hour.
- bite
-
Execute the bite script which hopefully can restart the service or
recreate the proper condition. If the bite script returns an exit code
of 0 goto state "check", else goto state "breath".
- wait
-
Wait the duration of seconds specified as the
check_interval,
then goto state "check".
- breath
-
Wait the duration of seconds specified as the
bite_interval,
then goto state "bite".
- stop
-
The watchdog has been stopped with the
-s
option.
- abort
-
The watchdog has been killed with the
kill watchdog_process_id
command (not recommended).
OPTIONS
- -h
-
usage message.
- -n name
-
name of the watchdog.
- -c check_script
-
script or single command that performs a check. If this script
returns an exit code <> 0 then the bite script specified with the
-b
option is started.
- -b bite_script
-
script or single command that performs the reaction (bite) if the
check script specified with the
-c
option returned an exit code <> 0.
- -i c_int
-
[,
b_int
]
-
- c_int
-
check interval in seconds. Every
c_int
seconds
the script specified in option
-c
is executed. If
-i
is not specified the default check interval of 500 seconds applies.
- b_int
-
bite interval in seconds. This is the interval between the execution
of the
bite_script
if the bite script returns an exit code <> 0, that means between
unsuccessful reactions (bites) of the script specified in option
-b.
If the
b_int
is not specified the default bite interval of 20 seconds applies.
- -r l_int
-
minimal interval in seconds between logging of the check state message.
If this option is not specified the default of 3600 seconds applies.
- -s name
-
Stop a running watchdog with the name
name.
Users can only stop own started watchdogs. Be aware that a
stop of a watchdog can be delayed by the duration of the check- or
bite script run. You should not kill the watchdog with the
kill
command.
- -l
-
list all running watchdogs.
- -L { [ start ] | stop }
-
list start/stop commands for all running watchdogs. This enables you to
easily restart all currently running watchdogs with the related command options.
ENVIRONMENT
-
EXIT STATUS
- 0
-
no error.
- 1
-
the configuration file
edrc/etc/watchdog.cfg
does not exist.
- 2
-
operating system not supported, see
osid(3)
- 3
-
cannot write to lock directory. The ability for all user to write
to the lockdir is mandatory.
- 4
-
usage displayed.
- 5
-
a watchdog with the same name as specified in the
-n
option is already running.
- 6
-
the specified check script does not exist during watchdog startup.
- 7
-
the specified bite script does not exist during watchdog startup.
- 8
-
cannot write to logfile. The ability for all users to write
to the logfile is mandatory.
- 9
-
unsuccessful attempt to stop a watchdog of an other user.
- 10
-
no watchdog is running while listing the started watchdogs with
the
-l
or
-L
option.
- 11
-
temporary directory could not be claimed or created in
/var/tmp.
Check the system temporary directory
/var/tmp
if you get this error, it is an indicator of system intrusion.
FILES
- edrc/etc/watchdog.cfg
-
configuration file of
watchdog.
See
watchdog.cfg(4)
for more information.
- edrc/var/watchdog
-
default lockdir. This directory holds the lockfiles of watchdog. Do
not edit them by hand. The lockdir can be configured in
watchdog.cfg,
see
watchdog.cfg(4)
for more information.
EXAMPLES
- 1) using watchdog to guard the AutoSYS event daemon
-
The AutoSYS event daemon is a program that is connected to
a database and initiates job starts on remote systems. In some
versions of AutoSYS the daemon tends to shut down after
a startup. To ensure the permanent availability of the AutoSYS
job scheduler the event daemon is guarded by a watchdog.
1.1)
check script
The check script uses the AutoSYS command "chk_auto_up" to
verify if the event daemon is up and running and returns 0
if this is true.
#!/bin/ksh
#
# eventor.check - check the event_daemon is running
#
# [00] 13.08.2004 CWa Initial Version
#
/bin/su - sys_asys -c "chk_auto_up"
if [ $? -eq 11 ]; then
exit 0
else
exit 1
fi
1.2)
bite script
The bite script uses the AutoSYS command "eventor -q" to start
the event daemon and the command "chk_auto_up" to verify if the
start of the event daemon was successful. If the start was
successful the script returns
0.
#!/bin/ksh
#
# eventor.bite - restart event_daemon if it not available
#
# [00] 13.08.2004 CWa Initial Version
#
/bin/su - sys_asys -c "eventor -q"
sleep 5
/bin/su - sys_asys -c "chk_auto_up"
if [ $? -eq 11 ]; then
exit 0
else
exit 1
fi
1.3)
watchdog startup
This watchdog checks every 5 minutes if the event daemon is up and
bites every 30 seconds if the event daemon start was not successful.
Check states are logged to the logfile once approximately every 30
minutes.
watchdog -n event_daemon \
-c /etc/cmcluster/asys_sv1_prod/eventor.check \
-b /etc/cmcluster/asys_sv1_prod/eventor.bite \
-i 300,30 \
-r 1800
SEE ALSO
edrcintro(1),
osid(3),
sh(1),
watchdog.cfg(4)
NOTES
-
BUGS
-
AUTHOR
watchdog was developed by Christian Walther. Send suggestions
and bug reports to wa2l@users.sourceforge.net .
COPYRIGHT
Copyright © 2008
Christian Walther
This is free software; see
edrc/doc/COPYING
for copying conditions. There is ABSOLUTELY NO WARRANTY; not
even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
This document was created by man2html
using the manual pages.
Time: 00:14:34 GMT, March 08, 2025