State

A Steamy Mystery

Jacob Lewallen

1.00.00


Table of Contents

1. Introduction
What is State?
2. Installation
Overview
Monitoring Server
State
PostgreSQL
Statistics Server
3. Configuration
Basics/Definitions
Nodes
Tasks
Vitals
Queries
Notifications
Walkthrough
Global/Main
Modules
Nodes
Tasks
Notifications
Disk
RRD
Soft States
4. Source
Outline
Apache Portable Runtime
Memory Pools
Errors
Object Relational System
Peers
SQL Backend
5. Modules
Introduction
mod_clean
mod_disk
Configuration
mod_dynamic
mod_exec
mod_flap
mod_httpd
mod_linux
mod_lua
mod_mail
mod_notifs
mod_ping
mod_response
mod_rrd
Configuration
Reporting
mod_snmp
mod_soft
mod_solaris
mod_services
mod_tasks
mod_wkitems
mod_xml
6. Using State
Statuses
7. Frequently Asked Questions
FAQs
General
Installation
Configuring
Running
8. State Internals
Flow Traces

List of Examples

8.1. Flow: mod_tasks
8.2. Flow: mod_rrd
8.3. Flow: mod_lua
8.4. Flow: mod_xml

Chapter 1. Introduction

Table of Contents

What is State?

So there you are. Everything has been running just fine all day. You're sort of nervous too because the phone has yet to ring for six hours. So the clock strikes five and you get your stuff and head for the door, already dreaming of that cold beer that's waiting for you. You swing open your office door and there's three or four people standing around outside. At the same time, three of them make it desperately clear they've been unable to check their mail for hours. Yes, something's broken, there will be no prime rib for you.

System Administartion is a very complex. A network has many individual pieces, ranging from printers to expensive routers. Many of the various systems rely on others to operate. Part of the art of administering a large network is being aware of what's going on and why. If something's broken now, then you need to know. Few things are as satisfying as knowing why the phone is ringing beforehand. On the other hand, it's frustrating to learn that something's been wrong for an hour when you've been playing video games since noon.

What is State?

State is a tool for monitoring the status of a networks connectivity, services, resources, and other general health related properties. You, as the System Administrator, tell State all you'd like about your network and its internals and State will keep an eye on things so you can descend deeper into laziness and maintain a higher BAC during business hours. So here are some features of State, just in case the boss asks and so I can justify my many hours of work:

  • Modular design, much like Apache. All functionality is implemented in extension modules that are loaded into the main State application. You can add and remove functionality by adding and removing these modules. This also allows for easier maintainence, security, debugging, and extension.

  • State uses standard protocols such as HTTP and storage methods such as SQL databases and XML. Yeah, who cares, I know. This is nice though, trust me. This means everything is easy to read and undestand. There are no learning curves as you learn the protocols or formats, you can just jump on in.

  • A large number of already written modules for checking many standard network health properties. State can check all of the most popular services out of the box, as well as disk space and other non-simple statistics.

  • Easily configurable alarms and warnings so that administators are informed when the health of their network changes. Notifications can be dispatched over e-mail at configurable intervals. They may also be escalated to other notification types if they go unresolved for a certain period of time.

  • Statistics gathering and plotting through the integration with the infamous rrdtool. Any number that comes in through the State's monitoring of the network may be placed into an RRD database and examined later.

  • A flexible API and layer upon which to build and extend the monitoring system far into the future.

  • New checks and statistics gathering scripts can be written in your language of choice and executed through State, just like simple CGI for the web.

  • Really slick graphing of nearly anything State is obvserving or handling that may pass through the system. Graphing is done using RRDtool as its backend. Anyone can flexibly define which State values are stored in which databases using a flexible database template definition.

State borrows many concepts from other, similar solutions such as Big Brother or Nagios. It's up to you the solution you choose. State's goal is to be the best piece of software State can be.

Chapter 2. Installation

Overview

State is a large piece of software and so there are a few dependencies that should be installed before you attempt to compile things. Note that if you do try and compile before fullfilling all of the dependencies that the configure stage will, hopefully, notice you are missing some necessary pieces. Dependencies do change depending on your needs for a specific installation. If you're installing a State server then you'll need more than if you were installing State on a machine that is going to be monitored. Steps necessary for installation also change depending on your goals for that particular installation, obviously. At any time, you can always check to see which configure options State will accept if you are in doubt in any way. This may be necessary to tell State where some of its dependencies are located at if you've installed them in alternate locations.

$ ./configure --help
    

Now all you really need to do is follow the appropriate Installation procedure, depending on if you're installing a Monitoring Server or a Statistics Server. Generally, you'll want to place a Statistics Server on all the machines you want to gather disk space, load average, and other such information on. Otherwise, nearly all the other types of checks can be done remotely, such as service checks and availability for example.

Monitoring Server

Generally there will only be one monitoring server on your network. It is the server responsible for checking all of the other servers, pulling data off the network in various ways, and issuing all of your notifications and alerts. It is also the most complicated type of State server to install.

State

If you have all of the proper dependencies, then running the standard autoconf procedure should work fine for installing State as a monitoring server. Note that you need to make sure PostgreSQL is installed, otherwise State will assume that it is running as a statistics server with no database support. Otherwise, just go through the standard steps:

$ cd STATE_SOURCE_DIRECTORY
$ ./configure
$ make
$ make install
      

By default, State will create and use a database with the name state. If you'd like to change this, use the --with-database-name configure option. If you, at anytime, would like to avoid building any libraries you may have placed in the State source directory, such as the Apache Portable Runtime, then you can use the state target in the Makefile to rebuild:

$ make state
      

This only builds State specific targets, no third party libraries or tools. After State is installed, you can setup the database State will be using.

PostgreSQL

Now that State is installed you'll need to setup your PostgreSQL database for storing the runtime information and running the PHP website. You can't run State until this is done. You'll need to choose your database name now, and the machine that you'd like to place the database on. It's suggested that you run PostgreSQL on the same machine as State, for performance reasons, although that isn't entirely necessary.

$ cd STATE_SOURCE_DIRECTORY
$ make database
      

After running this, there will be a PostgreSQL database with the name given during the configuration step and populated with the base schema. Now you can move on to the configuration step.

Statistics Server

Statistics server's are greatly simplified when compared to the monitoring server. No database support is required, and the configuration files themselves, as you'll soon see, are nearly identical for all statistics servers on the network. All that is really required is that you compile, install, and ensure connections can be made to the service from your monitoring server. Download and unpack the State source code and configure it like so:

$ ./configure --enable-pvd-only
    

This will instruct State to only build pieces necessary for a statistics server and to ignore other more monitoring specific parts. No web related executables will be compiled and linked either, only the essentials. If you need to fullfill any dependencies, libxml for example, then go ahead and do that and come back to this step. Follow this step up, if everything goes well, with the standard autoconf installation procedure:

$ make
$ make install
    

Now you'll need to make State available to the outside world. There are two popular ways of doing this. You can setup State to run as a daemon, which means it'll always be running. Or, you can run State from an INET type super server, which is the preferred solution. State, by default, is designed to run using the ucspi-tcp package available at http://cr.yp.to/ucspi-tcp.html, but should operate with any other similar solution. All you really need to do is ensure that runpvd.sh is executed when your machine starts up, and everything should go as planned. You can, optionally, install the SysV startup scripts to make managing the server easier and standard when compared to the other servers on your machine.

Chapter 3. Configuration

Basics/Definitions

Configurating State means writing your XML configuration file to inform State of its jobs and the properties that she should be using when getting things going. In order to do this, you'll need some definitions and some commentary on how things are organized, then we'll hit upon the actual configuration file and its internals. Notice that some example, template configuration files are provided, and these should provide a good foundation for you to get going.

Nodes

Node is just a general term for machines that have IP addresses and live on the network. Printers, routers, computers, and special purpose servers are all examples of nodes. State needs to know about a node if you plan to monitor anything related to it. It is safe to think of nodes as apliances on the network, if it's got an IP then it's a node, it is important to realize nodes are not just computers, although they almost always are. I think I've bashed that point to death, so we move on.

In order to simplify other areas of configuration that use nodes they can be grouped into Node Groups. Each node group has a unique name that is used to refer to it in other areas of the system. How nodes are grouped is up to you. Some possible groups in an academic installation would be:

  • Servers

  • Administrator Machines

  • User/Lab Machines

  • Research Machines

  • Printers

  • Routers

  • Room/Location

Nodes can belong to more than one node group. It is a good idea to keep a set of node groups for nodes that share a common set of services that need to be checked. Creating good node groups will come up a few more times in the comming sections and after some experience checking nodes you'll gain a better understanding of how best to group them. More organized node groups means that your tasks can be simplified and combined.

Note

Internally, State always uses IPs to refer to nodes. The resolved names of machines are mostly stored for display purposes. If State is referring to a node, it's always going to use the IP. Now, this doesn't mean the names aren't or won't be used. For example, some day in the near future I intend to write a mod_dns which will allow administrators to ensure names are resolving the proper IPS and such.

Tasks

Having your nodes grouped into useful sets now gives you the chance to operate on them; checking their status and gathering information. All of this is done with Tasks. A task is just that, a job that State carries out at a specified interval. Tasks can be very complicated and it is with tasks that much of the setup and configuration of State lies.

A task is always associated with at least one node. Tasks can also be associated with multiple node groups, hence the use in assigning node groups based on similar nodes. The nodes associated with a task are the nodes that the task operates on.

Tasks are executed by State at defined intervals. They are also scheduled internally so that the load on the monitoring server is minimal, since many tasks will operate at similar intervals. Inside each task is a collection of Queries, which are executed to perform the bulk of the processing related to the task. Tasks can also have properties/attributes that are used by extension modules for things not completely related to executing the tasks, these are explained in the extension module docoumentation and in the area walking through the configuration file.

Vitals

State revolves around the concept of Vitals. A vital is the object that changes to reflect changes in the health of the network. Every vital is associated with a node. It's the history and evolution of the vitals that we as system administrators are interested in. Vitals have a number of properties that are interesting. They have a status which describes how healthy they are, such as NORMAL, ALIVE, or WARNING. Each one as a simple text message called a note which describes its status in human terms and gives details we can use to discern problems. Vitals also fall into categories or types which make grouping them easier for certain jobs and purposes. Each vital also has a name, which identifies the vital. Names are unique on a node. No node can have two vitals with the same name. Finally, there are a number of other details that track when and how often the vital is being updated and when the vital changed.

Every vital has a history, a series of changes in status that is recorded. Using this information we can determine how often the vital is healthy and how often it is unhealthy. In addition, we can get a glimpse of how things were at a specific instance in time, all we need is a little creative SQL and a little one on one with the PostgreSQL database.

Queries

Each Task contains one or more Queries which are executed to actually gather information. Queries look very much like simple URIs, much like web URLs, only they are much, much more. There are two types of Query URIs, local and remote. Local Query URIs are executed on the local State server, Remote URIs are executed on remote State instances using HTTP. Here are some simple examples of Query URIs:

/state/kernel/loadavg
state://192.168.0.223:3434/state/network/ping
/state/network/ping
state://192.168.0.25/state/filesys/df
    

Pretty simple actually. The path portion of the Query URI is the only required part, everything else is either assumed or derived. The Path decides which handler is invoked on the State server. Handlers are provided by extension modules. By default, the handler tree, which can be thought of a simple file system like proc, is completely empty. Only when you begin loading modules into a State instance are valid paths created and handlers installed. You can install your own handlers if you like also by writing your own custom handlers. All State URI paths begin with /state, it is the root of the hierarchy.

Notifications

Notifications are your eyes and ears into what's going on with your State installation. Typically, you'll be receiving notifications in the form of e-mails from mod_mail's interaction with mod_notifs. Because e-mail is a common gateway to paging, it's pretty safe to assume that it's a fairly common module to load and let handle your notifications. Each task is placed into a Notification Type, which determines how notifications for problems arising from that task are handled. Notifications are quirky, and so it's necessary that you understand how things work so that you can design them to notify you in the most optimal way. For a description of notifications, see the documentation on mod_notifs, it's all in there.

Walkthrough

A State configuration file can be quite large and complicated. There are a number of different areas that you have to tweak, adding and removing things from in order to get it all worked out. There's a reason for this though - State is extremely flexible and is designed to be easily extendable in the future. This design axiom is reflected in his configuration. Often times you'll find yourself giving what you may consider too much information or giving simple information in a complicated manner. This is just your imagination, it's fine. Perhaps in the future we'll provide a simpler file format and a conversion tool. Until then, just go with this.

Global/Main

All State configurations begin with a very straightforward collection of tags and instructions. It is in this main area where the database is initialized, modules are loaded, and some global properties are assigned.

        

Modules

Nodes

Tasks

Notifications

Disk

RRD

Soft States

Chapter 4. Source

Outline

Hopefully you'll be filled with a deep urge to extend and contribute to State, fixing bugs and adding new features. In order to do that best though, you'll need a grasp on the source. So, here's some information on a few things that may be a little curious to newcomers. Before you start hacking on things, please read this section, it'll help iron some details out for you. Any other questions can just be redirected to the mailing list, that's what it's there for.

Apache Portable Runtime

State makes extensive use of the Apache Portable Runtime. APR is a portability layer developed for use by the Apache web server to hide non-portable code away from the server's core. It provides a very useful API for developing C based server software. State uses every feature that APR provides so that the software remains as portable as possible, or as portable as Apache anyway. For more information on APR, see the APR web site at http://apr.apache.org/.

Memory Pools

Because the software is written exclusively in C we found it very important to keep things as fast and robust as possible. Memory leaks are a common destroyer of robustness and so great care was taken to eliminate them. One attack that was used was the Memory Pool API included in the APR library. Memory pool's are simple region-based memory allocators. Nearly every method in the API takes an apr_pool_t from which it will allocate any necessary memory or resources. When all the work is done, the memory pool will then be destroyed. Memory pools can also be cleared, which allows the memory already allocated to the pool to be reused and is necessary in unbounded loops and other similar situations.

Pools are also used to govern the lifetimes of various objects. Because pools are all subpools of other pools they have a hierarchy. As an example each task has its own pool that it uses during execution, and any objects allocated for use over the entire task are allocated from that pool. It's important to realize that when writing functions that take pools, it is up to the caller how long the objects allocated by the function live, not the function. If you find yourselves creating pools often, then chances are something is being done wrong. For tips and pointers on how to use pools, the Subversion team has come up with a very nice list of pointers.

Errors

Anyone who's written any C knows that it lacks a serious form of error tracking and control. Nothing like C++'s exception handling system is available. This makes errors especially tedious to handle. Having look at quite a lot of source code in my time, I was quite pleased to stumble upon a very elegant solution to this problem. It's common practice to return numeric error codes from all functions in a specific API, for an example look at nearly every function in the APR public interface. State uses a system, first seen by me in the Subversion project, where a more complex error object is, potentially, returned from each public function. Such functions in State are easy to see, they all returned a pointer to an sul_error_t object. If no error occurs, then the function returns SUL_NO_ERROR, which is simply NULL. Things become more complicated when an error does occur. which is where the following macros and functions come in:

sul_error_create

All errors are initially created with this function. Each error is given a descriptive message and its source line number and file are marked and recorded. In addition, if the error is the result of an APR error, the APR status code will be given and included in messages. All errors are created in their own Pools, this avoids strange problems with lifetimes and such, since errors will often be passed up a call stack, through pool and lifetime boundaries.

SUL_ERR

Any call to a function that returns errors should be wrapped by this macro. It will check the return of the function for an error, and if one is found, return it from the current function, passing the error up the call stack very much like an exception.

SUL_ERR_WRAP

This macro is very similar to the SUL_ERR macro, except it creates and wraps the error returned with a new one that gives more information on the problem. This way, very general errors such as Premature end of file! can be futhured qualified with the operation that failed, so that the user can see that operation X failed because of Y happened and made Z fail. Otherwise, simple errors such as File Not Found are simply too broad to narrow the problem down, especially if multiple areas of the code base use that specific file.

Object Relational System

A large portion of the information State manipulates lies inside of a PostgreSQL database. Anybody that's used a database knows that saving and pulling data is largely a repetitive process. In State, this task is much easier because much of time we're only pulling data from a single row of information, rarely using data from joins in our actual work. In order to make State easier to debug and operate I've created a simple Object Relational System to handle saving and loading data from the database. For each of object types we define a field mapping with all of the fields we're pulling into the object from our queries and their various properties. There are then a few methods which takes queries and these field mappings and create or save objects to the database.

Peers

A Peer is just a table in the database. If you've studied any Object Relational Systems you know'll recognize the term. Peer's manage objects of a specific type. It's the peer that stores the field mapping and other details that help us manage the objects in a specific table.

SQL Backend

In a few areas we allow the ORS to generate the SQL we need, but only in simple situations where the SQL is very simple. Most of the time we have to write the SQL ourselves. Fortunately, the ORS helps us in the larger, more common case. Maybe someday we'll expand this, but right now things seem fine.

You may have noticed that there is a single library, libsos where all of the SQL lies. This is by design, to make any future changes easier to make maintaining the code easier. Any changes to the database will require changes to this library, and everything else should work just fine. Unless of course you change a field name or add new fields, then you'll have to use those fields and that will almost always happen outside of the sos library.

Chapter 5. Modules

Introduction

The stated daemon itself is merely a framework that loads customizable modules from a large collection. It is these modules that control exactly what role that daemon instance plays. The monitoring server instance of stated will load a different collection of modules than the statistics server. Modules give functionality, everything else is just supporting infrastructure.

mod_clean

Keeps the PostgreSQL database clean and vacuumed. A very simple module that is intended for database capable servers, where the stated instance will be using a PostgreSQL database regularly. Anyone that has managed a PostgreSQL database for intense operations knows that if you don't vacuum regularly your performance goes down the drain.

Load this on any State installation that has connections to a database that isn't being vacuum some other way.

mod_disk

In order to monitor disk space in a flexible manner State will execute programs similar to the UNIX df and parse the information given for each partition. It will turn that information in a collection of State values which are then passed through the pipeline. Anyone can write df like programs that State can run. For example, to faciliate monitoring our Network Appliance we use a custom df like program that is run from the State server.

Note

mod_disk requires a GNU version of df in order to operate correctly!

Configuration

Server Side df

Vitals

mod_dynamic

Handles loading extension modules from shared libraries. As of now, all the cool modules are compiled in staticly, so mod_dynamic has no real use until later, also it's untested. Special care has been taken to prepare for its use though, so there's no real reason why anything would go wrong. Please note that extension modules can never be unloaded once they are loaded, that's just the way things are.

mod_exec

Note

Incomplete!

mod_flap

Note

Incomplete!

mod_httpd

Provides an HTTP daemon on a specific port for serving remote queries. This is absolutely necessary on statistics servers, otherwise there's no way to move information around. mod_httpd will also be invoked if you configure State to run under an INETD setting. It will understand any valid HTTP request, but only specific ones will be honored with a useful response. It's generally intended to be used as a transport for other tools and such, rather than a browsable resource.

Note

mod_httpd's default port is 8021.

mod_linux

Under linux, mod_linux provides several statistics related handlers for giving away information like load average, system load, network traffic, and system identification. It's up to you what exactly you do with this information, we don't find it very popular to monitor them. Please, only load this module under Linux. Chances are things will fall apart if you use her under another operating system.

  • /state/system/loadavg

  • /state/system/who

  • /state/system/uname

  • /state/network/interfaces

mod_lua

Lua is a very simple, C like language that was designed to be embedded in larger programs. It's very easy to learn when you have programming experience. mod_lua provides a very simple Lua gateway, so that incoming information can be processed and interpreted in a simple way, making writing simple checks easier and lowering their overhead.

Note

Incomplete!

mod_mail

E-Mail notifications and messages are dispatched by mod_mail. It installs hooks that are invoked whenever an appropriate time to send an e-mail comes up. mod_mail only really sends e-mails for notifications triggered from mod_notifs. See the example notification types for examples on how to use mod_mail, you just add simply EMail tags to your notification script.

mod_notifs

This module listens in on the incoming vital updates and watches for changes in status. When a vital does change state, then the first thing mod_notifs does is grab the configuration for the task that the vital is being updated under. From this configuration, the Notification Type's configuration is retrieved, and the fun begins. These are the steps that mod_notifs takes when first noticing a vital changed status:

  1. Retrieve the notification type configuration from primary configuration file using the NotifType attribute associated with the running task.

  2. Check to see if the Ignore attribute is set to a boolean true value, and if so, the notification is discarded and processing stops.

  3. Record the notification in the pending notifications table, along with its timing information and the notification type. Execution of the vitals update then continues, leaving the nofication waiting until the notification thread can take care of her.

Every few seconds, the notification thread wakes up and checks for pending notifications. Then, for every pending notification the following series of steps is executed:

  1. See if this notification is new and has yet to be handled, if so we continue on. If it's been handled, we check to see if it's time for processing again, because its update interval has passed. If the update interval hasn't passed, then we move on to the next pending notification.

  2. Look for this pending notification in our collection of secondary notifications. This collection stores all notifications that are of lower priority than others because they are less severe. A node going down and it's PING vital failing is of higher priority than a service on that node becoming less healthy. If so, it's ignored.

  3. If we're a recovery notification, then delete any older, pending notifications for the vital that is recovering. This keeps those failure messages from continuing after the vital as recovered.

  4. See if the node that this notification is closer too is suspended, if so, delete and ignore the notification.

  5. Grab all of the pending notifications that are above this notification in our dependency tree and all of them that are below us. Check to see if all of the vitals above us are updated and newer than the vital for the notification being investigated. This means we know it's the extent of the problem. Otherwise, we stop processing and wait for the vitals we depend on to be updated.

  6. If we don't depend on any vitals, or all the vitals we do depend on have been updated since the notification was created, then we execute our notification type's script.

As you can see, notifications are pretty hairy creatures. Once all this is done, then the cool, fun stuff happens. Each Notification Type has a simple script that determines what happens when we're executing the notification. It is here where things get pretty interesting. A script is just a simple collection of XML tags that have specific meanings to mod_notifs:

EMail

Invokes a special signal that is created and hooked by mod_mail that sends an e-mail to the specified individuals. This is actually a special tag because it's never actually handled by mod_notifs. Because no matching tag can be found, it's treated as something implemented in another module and so a signal is emitted with the same name as the tag. This means it is very simple for people to add more actions and tags to the notification process.

After

A specially defined condition that causes the insides of the After block to be executed when the given duration of time has elapsed. This way notifications can do different things after a specified period of time. You can have any number of these conditional blocks, giving you a great deal of flexibility.

Delete

Deletes the notification. This is necessary, otherwise notifications would continue on and on forever until they recovered, unless of course that is what you'd like. It's most useful when used in an After block.

There you have your friendly household notification system. Please see the example notification configurations provided for more details and examples of how to construct useful notification types and use them in practice.

mod_ping

Node connectivity is one of the most important health related properties that State observes. In order to do this, mod_ping provides a means to check nodes using the absolutely beautiful tool, fping. In order to use mod_ping fping will need to be installed and executable by the user that State is running as. You can grab fping at http://www.fping.com/. When you configure mod_ping, you can setup the command line options how you like them, which means for some level of customization.

mod_response

Whenever a vital changes status, mod_response executes response handlers provided by the administrator to handle the change and take any necessary, predefined action.

Handlers for the network are arranged in a specially designed file tree. It's basic structure involves a directory for each node, with sub directories for each vital and then the response handler executables. As an example, here is a simple response handler tree:

/192.168.0.2/http/normal
/192.168.0.2/http/warning
/192.168.0.2/http/fatal
/192.168.0.2/ping/alive
/192.168.0.2/ping/dead
/192.168.0.7/ping/alive
/192.168.0.7/ping/dead
    

Each of the handlers are executes whenever that vital enters the status with the handlers name. All text in these cases is lower case, to avoid any ambiguities. It's up to you, as the administrator, what goes on inside of these. Oh, and please don't send e-mails from them.

Warning

mod_response is a major work in progress and an idealogical kink in my neck. While developing State, I always intended for her to be passive. I never imagined or intended State to go around trying to fix things on its own. The whole scenario is just to SkyNet like for me. It is useful for some people though, so I digress.

mod_rrd

Configuration

Reporting

mod_snmp

Every system administrator has stumbled upon SNMP, or will sooner or later. I personally despise SNMP, which is very unfortunate considering how popular she is. With mod_snmp you can map SNMP OID's into State's value tree. You define basic structure of the State value tree and for each node you place a simple OID inside. It's pretty simple, although we plan on improving things and making them more flexible. Our goal is to make SNMP values appear the same as standard State values, which they do, but there's still more to be done in the area of configuration.

mod_soft

Some Vitals are a little unstable and inconsistent with their checks, sometimes returning failure during network hiccups or other stressful conditions. In these cases it's useful to allow them so many chances to recover before they actually become a problem. mod_soft keeps track of the number of times a Vital has changed from its active status and will only change its status if things stay changed, and don't recover immediately.

As an example, assume you have a vital FOO on some random machine. FOO is in the NORMAL status and gets updated to the WARNING status. FOO will remain NORMAL until one more WARNING update comes in, finally pushing FOO to WARNING. If, after that first update, a new one comes in for NORMAL, signifying some kind of hiccup or flapping situation, then FOO will remain NORMAL, just as if nothing had happened. Sometimes this behavior isn't desired, since such changes can mean there are significant problems around.

mod_solaris

Under the Solaris SunOS operating system, many of the Linux specific checks just don't work very well, so it's necessary to use this module when gathering statistics under the Solaris operating system. Only try and load/use this module under Solaris.

mod_services

Services are one of the prime reasons to install State, they are the whole reason for networks of computers. If machines are unable to access service such as the web or mail, then people might as well go on vacation or something. Each service can be in one of three states, NORMAL, WARNING, or CRITICAL. Of course, you'd rather they all be NORMAL. It is up to you which services are checked and the properties to use when checking them. Checking a service yields a bit of information on them depending on its status:

NORMAL

Things are just fine with this service! Typically the time the service took to respond is provided, along with any other identifying information the service provided during the check, such as version numbers. We all wouldn't mind having our services like this all the time.

WARNING

The service could definately be better off. Typically this means things are just unhealthy, and that the service is just nearly FATAL. It's useful for non-serious things that could quickly become serious. Each Vital that is in this status will have a descriptive message giving some ideas concerning why the service is not up to par, but isn't that a good thing?

FATAL

Drop everything and check on this service! This service is pretty much unavailable and people will begin complaining soon. Just as with the WARNING status you should check the Vital's note for information on the problem.

If you see service check yield statuses other than these, then something may be a miss and you should definately check on things, or at least the service checks documentaion. State includes a large number of built in service checks that can be used to ensure things are running smoothly and to keep you from having to write them yourself.

  • HTTP
  • FTP
  • SSH
  • SMTP
  • POP
  • IMAP
  • TCP
  • NIS
  • NFS

Many of the service checks can be executed using SSL, which allows you to check the secure forms of the services, such as HTTPS or SIMAP. You may also specify the port number the service is on, just in case they are non standard. In other situations, like NIS/NFS, all you do is specify the node and off the test goes.

mod_tasks

Monitoring servers will need to load mod_tasks, its job is to load the tasks from the configuration and execute them at the proper intervals. It uses a pool of threads to handle its tasks, handing them off to idle threads for processing. For more information on the tasks, see the section on the task configuration.

mod_wkitems

mod_xml

State uses XML in quite a few places. With mod_xml, State will dump XML files containing useful health related information that other tools can use to display information. As an example, mod_xml will write a nodes.xml file whenever node health changes containing the status of all network nodes. The mapping system then uses this file to determine which nodes are up and which are down. Third party tools that dump Vital tags also use mod_xml, as it's where that tag gets turned into an actual vital update.

Chapter 6. Using State

Table of Contents

Statuses

Statuses

State uses a number of statuses to denote the various health levels of vitals that its monitoring. Each status has a label and a degree. Its label is used for identification purposes as well as for display. A statuses' degree is more interesting. Degree is a real number that ranges from 0 to 1 and numerically quantifies health. A degree of 1.0 is perfectly healthy and a degree of 0.0 is totally and completely dead. Some statuses have degrees that lie in between, such as WARNING, with a degree of 0.5. It's the degree that is used to determine if a vital has recovered by increasing in health. Degrees are also used to calculate the health of a large number of vitals to great accuracy. Many statuses have differeing labels by the same degree, such as ALIVE and NORMAL. This is only because one statuses makes more sense in a specific situation. Later, there may be ways of classifying and using these differing statuses to provide more flexibility.

Chapter 7. Frequently Asked Questions

FAQs

State hasn't been released for very long, not nearly long enough for anything to be asked frequently.

General

Who do I complain to? Who's responsible for this?

You can reach me, Jacob Lewallen, by e-mailing me at . Please try and be specific, and calm. State also has a mailing list that you can subscribe to and discuss things on. You may subscribe to the mailing list at https://www.cs.ucr.edu/mailman/listinfo/state-dev/

How do I contribute?

If you have an idea for a feature or have found a bug, you can begin by contributing a patch either to Jacob or the State mailing list. Otherwise, just join the mailing list and ask for something to do! I'm sure we can find something to put you to work on. Even if you can't program there are things to do. As you can see, even this documentation is a lacking in attention.

Installation

Where do I get all of the dependencies?

Here is a list of packages that people usually need to find in order to compile, link, and run State:

Configuring

Can I make State tell me when my steak is medium rare?

Nobody has written mod_steak just yet.

Running

State just flips out and goes crazy on me!

Oh no.

Chapter 8. State Internals

Table of Contents

Flow Traces

Flow Traces

Example 8.1. Flow: mod_tasks

    mod_tasks:
      smd_worker_spawn_tasks
        "pushes tasks into g_threadp_tasks thread poool"

      smd_threadpfn_tasks 
        "executes the tasks as they arrive"
    
      task_execute_queries
        smd_execute_query
          smd_run_query_sending
          smd_realm_server_query
            if local query:
              scd_realm_server_query
                scd_reply_create
                scd_tree_node_search
                scd_tree_resolve
                scd_query_setup_paths
                INVOKE: HOOK SERVE FUNCTION
                RETURN REPLY
            else:
              scd_server_query_remote
                sul_httpd_req_create
                sul_httpd_req_open
                sul_httpd_req_retrieve
                sul_configuration_parse
                scd_reply_parse_cfg
                sul_httpd_req_close
                RETURN REPLY
      

Example 8.2. Flow: mod_rrd

    mod_rrd:
      sul_configuration_load
      check_definitions
        if check_source_filter
          check_paths
            rrd_database_create
            incorporate_values
              rrd_database_series_push
            rrd_create_database
              rrd_create_directory
            rrd_update_database
            write_manifest_file
      

Example 8.3. Flow: mod_lua

    mod_lua:
      modula_instance_create
        lua: smd_create
      reply_walk
      if !interested:
        RETURN
      lua: smd_reply_received
      

Example 8.4. Flow: mod_xml

    mod_xml:
      sul_configuration_load
      mnfst_check_definitions
        mnfst_check_paths
          mnfst_create_directory
          mnfst_write