Search:  
Gentoo Wiki

HOWTO_Watchdog_Timer


This article is part of the HOWTO series.
Installation Kernel & Hardware Networks Portage Software System X Server Gaming Non-x86 Emulators Misc

Contents

Introduction

A watchdog is a tool that is supposed to automatically reboot a computer when something goes wrong. For example, the kernel goes crazy, some program starts using 100% cpu cycles, or even other problems. The idea is to have a device /dev/watchdog and your computer must write to this device once in a minute. If you computer fails to do so it will be rebooted.

The main difference between hardware and software watchdogs is who created /dev/watchdog. If it's created by a piece of hardware it runs independently of your kernel and other software on your computer. If you just setup a software watchdog the device is created by your kernel and if the kernel locks up it cannot reboot your computer anymore.

The program that writes to the device is called watchdog (sys-apps/watchdog) and may also monitor other parts of your computer like if some chosen processes are running or your network interface is still receiving data.

Setup a software or hardware watchdog device

Warning: There may be a lot of errors in this HOWTO and it's mainly based on other peoples experience. Make sure you understand what you're doing before you mess around with your system.
Note: F28: problems listed in this article having to do with false positives caused by incorrect handling of pings and network interface statistics appear to have been resolved starting watchdog-5.3.1_p2.ebuild


Tools/Skills needed


Configure the kernel (2.6 series)

First you need to find out what kind of watchdog card you have. See your kernel's configuration menu for supported cards. If you don't have a hardware watchdog card select Software watchdog.

Linux Kernel Configuration: Watchdog
Device Drivers ->
  Character Devices ->
    Watchdog Cards ->
      [*] Watchdog Timer Support
      [ ]   Disable watchdog shutdown on close 
      ---   Watchdog Device Drivers
      <*> Your Watchdog card or chip


For the start don't activate Disable watchdog shutdown on close as this means that you cannot shutdown your software watchdog without having your computer rebooted by the kernel.

If you finished the configuration (for hardware watchdogs see also below) compile your kernel and reboot.


Additional kernel configuration for hardware watchdogs

If you have a watchdog add-on card, you may need to have pci or isa support activated in your kernel. If your watchdog is part of your motherboard, you need to enable support for its control - for example i2c support, SMBus, or maybe just support for a chipset. You may need to turn your SMBus controller in your BIOS.

This HOWTO assumes that you already have this set up, or that you know how. You can find out what you need to do by starting at the websites of your motherboard and/or watchdog manufacturers. You may also need to emerge companion programs, for example, the program i2c which can be used to access on-board chips. Generally, companion programs are not needed to get watchdog functionality.

Check for the watchdog device entry

If your kernel configuration is alright and the watchdog card has been recognized by kernel (or simulated using a software watchdog) you should have a watchdog device node (/dev/watchdog):

# ls -l /dev/wa*
crw-rw---- 1 root root 10, 130 /dev/watchdog

Normally you should get this device automatically but if you're using older device filesystems (devfs, ...) you may need to create the watchdog device manually:


# mknod -m 660 /dev/watchdog c 10 130

Install and configure control software

emerge sys-apps/watchdog

This program is used to write to the watchdog device to tell it that everything is OK. There maybe other tools that do the same but in this Howto we use sys-apps/watchdog. It is essential for software and hardware watchdogs!

# emerge watchdog


edit /etc/watchdog.conf

The default watchdog configuration doesn't monitor your system. If you just want to have you computer rebooted if your system locks up and watchdog can't write to /dev/watchdog anymore this is perfectly alright and you don't need to change this configuration file.

If you want to monitor some other system functionallity you can do it with watchdog but be aware that there are many bugs that may cause false alarms/reboots. You may want to setup a repair binary for some of these options because otherwise watchdog solves any issue by rebooting the system (which is not always the best solution). Also use --no-action and -v as startup options to test everything (see below).


watchdog-device allows you to specify the devicename of your watchdog. This should always be /dev/watchdog. If you don't specify the device your software or hardware watchdog will not be activated and cannot reboot your computer. Default is NULL so you need to activate this.

watchdog-device         = /dev/watchdog


pidfile monitors program pidfiles. watchdog is checking if the corresponding process is still running. For example

pidfile                 = /var/run/metalog.pid
pidfile                 = /var/run/apache2.pid
pidfile                 = /var/run/authdaemon.pid
pidfile                 = /var/run/imapd.pid
pidfile                 = /var/run/sshd.pid
pidfile                 = /var/run/svscan.pid


interface monitors if there was traffic between two watchdog intervals. If not the watchdog software assumes that the network is unreachable and calls a repair binary (or just reboots the computer). ATTENTION: This function is broken and cannot handle interfaces which had more than 2.1 GB of traffic (see Known Bugs and Patches section below).

interface               = eth0


min-memory checks if enough memory is available. It's not measured in bytes but in pages.

min-memory              = 1


max-load can monitor your current load and if it's too high reboot your computer.

max-load-1              = 24
max-load-5              = 18
max-load-15             = 12


ping pings a host and assumes network unreachable if the host doesn't reply. ATTENTION: The implementation of ping is broken and you'll have a lot of false alarms (see Known bugs and Patches section below). Currently you better don't use ping at all!

ping                   = 172.26.1.255


file monitors a file for changes. If for example a process is supposed to write to a logfile and stops doing so you can let watchdog repair this or reboot the system. Use change to specify how often the file has to be changed. The value is counted in watchdog intervals (which are normally at 10 seconds).

file                   = /var/log/everything/syslog
change                 = 20

edit startup options in /etc/conf.d/watchdog

Now you can test you watchdog configuration. If you use watchdog to monitor system functionallity (traffic on interfaces, processes, ...) you should start with:


WATCHDOG_OPTS="-v --no-action"


Using -v activates verbose output to your syslog. As you will notice this is not very useful for longterm use as you get a lot of syslog messages (watchdog writes its status every 10 seconds by default). In the beginning you should also use --no-action as some of the watchdog monitor functions are broken and trigger false alarms. You don't want your computer to be rebooted without a reason (eg. you had more than 2.1 GB traffic on eth0 ...).

For longterm testing add -f and choose a higher value for interval in watchdog.conf (see above). This allows you to extend the interval to e.g. 300 seconds which means a lot less traffic in your syslog. But NEVER use -f without --no-action because otherwise your watchdog will reboot your computer after 60 seconds. (Btw: Don't expect logtick to function. It's broken.)


If you're using metalog you can monitor your logfiles using

# tail -f /var/log/everything/current | grep watchdog


If you see that the basic system monitoring works as you expect you can/should remove all startup options again. I recommend to also remove -v because the logtick option (see above) is broken and you don't want to see the watchdog status in you logfiles every 10 seconds. As soon as logtick is fixed (eg. you apply the unofficial patch) -v will be alright.

WATCHDOG_OPTS=""


Please note that as long as you use --no-action you don't really test your watchdog device. With --no-action watchdog doesn't open /dev/watchdog and doesn't write to it all.

Setup a repair binary

There are some repair binaries shipped with watchdog, mainly shell scripts. You can modify them to handle some issues without rebooting the system. You need to specify the repair binary in your /etc/conf.d/watchdog as binary.

You can skip this part for now if you want and set it up later.

Start Watchdog

Run watchdog using startup script

The recommended way to manually start watchdog is

# /etc/init.d/watchdog start

Make sure you configured it correctly and don't add it to your boot runlevel before you know that everything is alright!


Add it to your boot runlevel

If you tested your installation for a while and if you're sure that no false alarms will trigger a reboot you may add it to your startup scripts.

# rc-update add watchdog boot


Known Bugs and Patches

Comments, my experience with this procedure

On my machine, a dual Opteron, I was not able to get the program to work correctly when added to the boot runlevel. I ended up running it in /etc/conf.d/local.start. Not the best, but it works for me now. I note that the program didn't seem to log at the correct intervals at first. Now, however, it does. I don't know why. Maybe it is just not happy in the amd64 arch. I note that if I set it to ping my default gateway, it sometimes reboots, for no reason. I don't know why it couldn't get a ping thru. I have since guessed that my provider's gateway may not have responded to a ping, although I can't imagine why.

I also have the same problem with my dual opteron.

Concerns or Compliments? Please use the Discussion section.

Retrieved from "http://www.gentoo-wiki.info/Watchdog"

Last modified: Fri, 05 Sep 2008 08:21:00 +0000 Hits: 42,974