.. default-role:: literal


Svmonitor
=========

.. contents:: Contents
   :local:

What is svmonitor ?
-------------------

`Svmonitor` is simple helper that lets you run some tests on your systems and
set up actions if they fail. 

For example, you could use it to check that some of your important programs are
still running, or to regularly check that you haven't run out of disk space...

Download
--------

Svmonitor is available both as a Pacman_ package (x86_64 only), or in source
form. For source or Pacman installation, please refer to the
`Generic download and install instructions`_.

* `Latest source tarball`_
* `PGP signature`_

.. _Pacman: http://www.archlinux.org/pacman/
.. _Generic download and install instructions: ../download-install-doc_en.html
.. _Latest source tarball: ../../public-repo.svasey.org/src/svmonitor.tar.gz
.. _PGP signature: ../../public-repo.svasey.org/src/svmonitor.tar.gz.sig


The big picture
---------------

`Svmonitor` runs *tests*. Tests are just commands that can fail or succeed. Upon
failure or success, the program takes *actions*, like sending email or shutting
down your server.

Tests and actions are setup in a global configuration file at
`/etc/svmonitor-config/svmonitor.conf`. Each test and action gets referred to
using a simple name.

Suppose you have set up a test named `check-disk-usage` that checks if you have
enough disk space. Then to run it you would just run::

    $ runtest check-disk-usage

This will take care of running the actual commands and taking the configured
actions in case of success or failure. 

Actions can be configured to run everytime the test fail, or e.g only once after
the first failure.

Note that `svmonitor` does *not* schedule the tests: this can be done using
another tool such as fcron_.

.. _fcron: http://fcron.free.fr/

The configuration file
----------------------

The configuration file is in `/etc/svmonitor-config/svmonitor.conf`. It
describes two things: the *tests* and the *actions* mapped to those tests.

The file's syntax is very simple (INI_ like). Sections, whose names are between
brackets, contain variables, each of which is given line by line using the
`varname = value` format. Any line beginning with a `#` is considered a comment
and isn't parsed.

.. _INI: http://en.wikipedia.org/wiki/INI_file

An example configuration
########################

Here is an example file::

    [action_mail]

    action_command = testaction-sendmail root@localhost

    [test_fail]

    test_command = false
    failure_actions = mail

    [test_fail-once]

    test_command = false
    failure_actions_once = mail

    [test_fail-once-2]

    test_command = cat /home/sebastien/tmp/svmonitor.test
    failure_actions_once = mail

    [test_fail-only-once]

    test_command = false
    failure_actions_once_only = mail

    [test_success]

    test_command = true
    success_actions = mail

This file defines five tests, named `fail`, `fail-once`, `fail-once-2`,
`fail-only-once` and `success`, and one action, named `mail`.

Formally, the file can describe any number of actions and tests, each of which
in its own section. 

Actions
#######

Each action should be in its own section. If the action's name is `$name`, it
should be in a section named `action_$name`. 

An action can be seen as a command taking three arguments: the error code
returned by the test, the test's name, and a path to a file providing more
details on the test's result. This command is specified by the `action_command`
variable. 

The `testaction-sendmail` command from the example is provided by the
`svmonitor` package for convenience. Such actions are documented in
`Implemented actions`_.

Tests
#####

Each test should be in its own section. If the test's name is `$name`, it should
be in a section named `test_$name`.

Each test is described by seven variables:

* `test_command`: The command to run to execute the test. If the command returns
  0, the test is considered as successfull, otherwise as failed. This variable
  *must* be given for each test.
* `failure_actions`: Names of actions to take in case of failure, separated by
  spaces. The actions will be executed in the order in which they are given. If
  this variable is not given, nothing will be done.
* `failure_actions_once`: Actions to take in case of failure *just after a
  success*. Those actions will be run only when the test failed, and succeeded
  the last time. For example, assume your test pings your server and you run the
  test every 5 minutes. You only want a mail the first time the server is
  noticed to be down, not every 5 minutes afterward !
* `failure_actions_once_only`: This is even more restrictive than
  `failure_actions_once`: those actions are run the first time the test fails,
  and *are never run again*, except after you use a special `--reset` option.
* `success_actions`, `success_actions_once`, `success_actions_once_only`: Like
  `failure_actions`, but in case the test succeeeds.

Usage
-----

To run a test, simply call `runtest $testname`, where `$testname` is the name of
your test. Two other modes of operation are possible:

* `runtest --info [$testname]`: Print information on the last result of the test
  and whether it is "reset", i.e whether the `once_only` actions will run. If no
  test name is given, this is done for all tests.
* `runtest --reset [$testname]`: Reset the test, so that `once_only` action can
  run again. If no test name is given, this is done for all tests.

The test directory
------------------

`svmonitor` must keep track of e.g the previous results of the tests it
ran. Those results are stored in `/var/lib/svmonitor`. This directory should be
owned by `root:svmonitor`, and its permission bits should be `1770`. Users that
want to use `svmonitor` must be members of the `svmonitor` group.

Implemented actions
-------------------

For convenience, some common actions commands are implemented in the `svmonitor`
package. They are documented below.

testaction-sendmail
###################

This sends a mail to the email addresses given as first argument,
comma-separated. 

Implemented tests
-----------------

For convenience, some common test commands are implemented in the `svmonitor`
package. They are documented below.

diskspace-check
###############

This can be used to check that enough disk space is available. The command exits
with 0 status if and only if all partitions have enough disk space.

The definition of enough disk space must be provided by the user. Three methods
can be used:

Minimum percentage
******************

This is the simplest one: `diskspace-check` exits with an error if the
percentage of available disk space (with respect to the total space) is lower
than some threshold. To use that method, just do::

    $ diskspace-check $minpercent

for example, for a threshold of 5%, you will run `diskspace-check 5`.

Minimum percentage with minimum size
************************************

This is less restrictive than "pure" minimum percentage: the program will exit
with an error if the minimum percentage criteria is satisfied *and* the
available space is lower than some threshold. This is useful when working with
large drives, e.g assume you have a 1TB drive: if it is 95% full, there is still
50GB available, which is a lot.

To use that method, run::

    $ diskspace-check $minpercent $minsize

The size is assumed to be in bytes by default, but you can use all the standard
multipliers, e.g 1KB is 1000 *bytes*, 1Kb is 1000 *bits* and 1KiB is 1024 bytes.

Custom method
*************

You can also specify your own program to do the tests: the program should take
as argument the used space and total space, in *bits* and return with 0 status
if and only if enough disk space is available. To use that method, run::

    $ diskspace-check $progpath

Where `$progpath` is the path to your program.

Options
*******

Some partitions can of course be excluded from the tests. There are two useful
options to do that:

* `--exclude-dev=regexp`: if the device path (first column in `df` output)
  matches the given `regular expression`_, the device is excluded from the
  tests. Note that invalid device paths (like "none") are also excluded by
  default.
* `--exclude-mount=regexp`: if the device mountpoint (last column in `df`
  output) matches the given regular expression, the device is excluded from the
  tests. 

.. _regular expression: http://docs.python.org/library/re.html#regular-expression-syntax

webpage-check
#############

This can be used to check that a given webpage is responding correctly, and
hasn't changed. The script uses `wget`_ for downloading the page. It takes two
arguments:

* The URL to download
* A local version of the content: if the downloaded file is not exactly the same
  as the local version, this is an error. This argument is optionnal.

In addition, one option, `--wget-opts`, can be used to specify additionnal
space-separated options to pass to `wget`.

.. _wget: http://www.gnu.org/software/wget/

daemon-check
############

This is used to check that some programs are running on your system. The script
takes as arguments the name of the programs that should be running, and compare
each of them to the last column of a `ps -ef` output. If the two begin the same
way, the program is considered as running, otherwise the test exits with an
error.