watchdog

Package: WA2L/edrc 1.5.57
Section: General Commands (1)
Updated: 09 April 2008
Index Return to Main Contents

 

NAME

watchdog - a watchdog to check a condition and react on it

 

SYNOPSIS

edrc/bin/watchdog [ -h ]

watchdog -n name -c check_script -b bite_script [ -i c_int [, b_int ] ] [ -r l_int ]

watchdog -s name

watchdog -l | -L { [ start ] | stop }

 

AVAILABILITY

WA2L/edrc

 

DESCRIPTION

The intention of the watchdog command is to easily monitor a certain condition or service and react if the condition is not OK.

The states an terminology of the watchdog is adapted from the behavior of a real dog which checks, waits (between checks), bites (if something is not OK) and breaths (to recover between bites).

The different possible states of all running watchdogs are logged to a logfile.

The check script has to be written that way that a successful check returns an exit code of 0. An unsuccessful check has to return an exit code <> 0. The interval between check script starts (= check interval) can be specified with the -i option.

The bite script (= reaction to an unsuccessful check) has to be written that way that a successful bite returns an exit code of 0. An unsuccessful bite should return an exit code <> 0 to ensure a direct bite script restart. The interval between bite script restarts (= bite interval) can be specified with the -i option.

The bite script and the check script must exist on startup of a watchdog.

If a bite- or check script disappears during operation the number of retries can be configured in the configuration file watchdog.cfg with the RETRIES_ON_MISSING_SCRIPT setting.

The watchdog has the following states which are also recorded to the logfile:

check
This is also the first state after the start of a watchdog. Execute the check script. If the check script returns an exit code of 0 goto state "wait", else goto state "bite".

To avoid logfile fill ups of subsequent "check" state recordings, the "check" state is logged in a bigger interval, which can be specified with the -r option if there is a need to deviate from the default of an hour.

bite
Execute the bite script which hopefully can restart the service or recreate the proper condition. If the bite script returns an exit code of 0 goto state "check", else goto state "breath".

wait
Wait the duration of seconds specified as the check_interval, then goto state "check".

breath
Wait the duration of seconds specified as the bite_interval, then goto state "bite".

stop
The watchdog has been stopped with the -s option.

abort
The watchdog has been killed with the kill watchdog_process_id command (not recommended).

 

OPTIONS

-h
usage message.

-n name
name of the watchdog.

-c check_script
script or single command that performs a check. If this script returns an exit code <> 0 then the bite script specified with the -b option is started.

-b bite_script
script or single command that performs the reaction (bite) if the check script specified with the -c option returned an exit code <> 0.

-i c_int
[, b_int ]

c_int
check interval in seconds. Every c_int seconds the script specified in option -c is executed. If -i is not specified the default check interval of 500 seconds applies.

b_int
bite interval in seconds. This is the interval between the execution of the bite_script if the bite script returns an exit code <> 0, that means between unsuccessful reactions (bites) of the script specified in option -b. If the b_int is not specified the default bite interval of 20 seconds applies.

-r l_int
minimal interval in seconds between logging of the check state message. If this option is not specified the default of 3600 seconds applies.

-s name
Stop a running watchdog with the name name. Users can only stop own started watchdogs. Be aware that a stop of a watchdog can be delayed by the duration of the check- or bite script run. You should not kill the watchdog with the kill command.

-l
list all running watchdogs.

-L { [ start ] | stop }
list start/stop commands for all running watchdogs. This enables you to easily restart all currently running watchdogs with the related command options.

 

ENVIRONMENT

-

 

EXIT STATUS

0
no error.

1
the configuration file edrc/etc/watchdog.cfg does not exist.

2
operating system not supported, see osid(3)

3
cannot write to lock directory. The ability for all user to write to the lockdir is mandatory.

4
usage displayed.

5
a watchdog with the same name as specified in the -n option is already running.

6
the specified check script does not exist during watchdog startup.

7
the specified bite script does not exist during watchdog startup.

8
cannot write to logfile. The ability for all users to write to the logfile is mandatory.

9
unsuccessful attempt to stop a watchdog of an other user.

10
no watchdog is running while listing the started watchdogs with the -l or -L option.

11
temporary directory could not be claimed or created in /var/tmp. Check the system temporary directory /var/tmp if you get this error, it is an indicator of system intrusion.

 

FILES

edrc/etc/watchdog.cfg
configuration file of watchdog. See watchdog.cfg(4) for more information.

edrc/var/watchdog
default lockdir. This directory holds the lockfiles of watchdog. Do not edit them by hand. The lockdir can be configured in watchdog.cfg, see watchdog.cfg(4) for more information.

 

EXAMPLES

1) using watchdog to guard the AutoSYS event daemon

The AutoSYS event daemon is a program that is connected to a database and initiates job starts on remote systems. In some versions of AutoSYS the daemon tends to shut down after a startup. To ensure the permanent availability of the AutoSYS job scheduler the event daemon is guarded by a watchdog.

1.1) check script

The check script uses the AutoSYS command "chk_auto_up" to verify if the event daemon is up and running and returns 0 if this is true.

        #!/bin/ksh
        #
        # eventor.check - check the event_daemon is running
        #
        # [00] 13.08.2004 CWa   Initial Version
        #
        /bin/su - sys_asys -c "chk_auto_up"
        if [ $? -eq 11 ]; then
                        exit 0
        else
                        exit 1
        fi

1.2) bite script

The bite script uses the AutoSYS command "eventor -q" to start the event daemon and the command "chk_auto_up" to verify if the start of the event daemon was successful. If the start was successful the script returns 0.

        #!/bin/ksh
        #
        # eventor.bite - restart event_daemon if it not available
        #
        # [00] 13.08.2004 CWa   Initial Version
        #

        /bin/su - sys_asys -c "eventor -q"
        sleep 5

        /bin/su - sys_asys -c "chk_auto_up"
        if [ $? -eq 11 ]; then
                        exit 0
        else
                        exit 1
        fi

1.3) watchdog startup

This watchdog checks every 5 minutes if the event daemon is up and bites every 30 seconds if the event daemon start was not successful. Check states are logged to the logfile once approximately every 30 minutes.

        watchdog -n event_daemon \
                 -c /etc/cmcluster/asys_sv1_prod/eventor.check \
                 -b /etc/cmcluster/asys_sv1_prod/eventor.bite \
                 -i 300,30 \
                 -r 1800

 

SEE ALSO

edrcintro(1), osid(3), sh(1), watchdog.cfg(4)

 

NOTES

-

 

BUGS

-

 

AUTHOR

watchdog was developed by Christian Walther. Send suggestions and bug reports to wa2l@users.sourceforge.net .

 

COPYRIGHT

Copyright © 2008 Christian Walther

This is free software; see edrc/doc/COPYING for copying conditions. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


 

Index

NAME
SYNOPSIS
AVAILABILITY
DESCRIPTION
OPTIONS
ENVIRONMENT
EXIT STATUS
FILES
EXAMPLES
SEE ALSO
NOTES
BUGS
AUTHOR
COPYRIGHT

This document was created by man2html using the manual pages.
Time: 00:14:34 GMT, March 08, 2025