Hacking NOCtool

The NOCtool project

The scheduler

The scheduler class points to a double-linked queue of time-slots. These time-slots in turn know when they're supposed to run and what events that need to be run at that time.

Adding an event to a scheduler is a two-fold process. First find (or create) the time-slot that corresponds to the time we want to run a specific event, then push the event onto the scheduler.

The way any event is being run is by having the generic function PROCESS called with the event as argument. In general, events are monitor objects.

The scheduler API is fairly limited, consisting of the following functions:

schedule object time &optional scheduler
Schedule an event at a given time. Unless specified, this will be added to the default scheduler.
next-time &optional scheduler
Get the time for the next time-slot. It is unspecified what happens if this is called with a scheduler that has no events scheduled. [ probably worth making it return NIL ]
next-timeslot &optional scheduler
Return the next time-slot and remove it from the scheduler.

Equipment

Two important concepts in NOCtool are "equipment" and "monitors". Essentially, each "equipment" represents a physical network element (a server, switch, router, hub or any other piece of physical kit we want to monitor) and each "monitor" is one specific thing on the equipment we want to check (network interface status, network traffic, certain processes running/not running, TCP-based services responding, disk utilisation...).

The current severity-of-fault for any equipment is the maximum of the severity status for any of its monitors, so in that respect, monitors are treated equally. The alert-level is a number in the 0-255 range and if ever set to a lower-than-current value will slowly move towards the lower value. At the moment, the decay is linear (every time a new value is set, the alert level will be the maximum of old-5 and new).

Each equipment class can have 0 or more default monitors associated with it. There's a mapping from the equipment class to a list of monitors that should always exist for that equipment class.

Monitors

Monitors can either be graphing or non-graphing. Only graphing monitor keep any historical state (beyond "what is my current alert level"). There's several types of built-in graphin classes, depending on the expected characteristics of the data the monitor retrieves.

Graphs

The graph classes come in several sub-classes, depending on how they transfer data from shorter-term storage to longer-term storage. Each storage class keeps 300 records and every 12th value added to a storage-class initiates a transfer of data to longer-term storage.

If we have a graphing monitor that stores data every 5 minutes, this means we'll have data with 5-minute granularity in the short-term storage, for a maximum of 25 hours of detailed data. Every hour, this data will also update the medium-term storage, for roughly 12 days worth of data with hourly granulatity and twice a day, the medium-term data will be used to populate the long-term storage, where we will have 150 days worth of data with a rather coarse granularity. This method is obviously inspired by MRTG and RRDTool.

The graph subclasses determine how data is treated as it is shifted from one storage to another. Looking at this, what's actually being shifted isn't the "latest 12", it's the "the 12 that are about to be over-written".

Graph subclassBehaviour
gauge-graphThe transferred value is the median value in the last 12 time units (actually the mean of the 6th and 7th value, when the last 12 are sorted in order)
meter-graphThe value about to be over-written in the shorter-time storage is transferred over. This is intended for data sources that are increasing (like, say, an output byte counters)
max-graphThis adds the maximum of the last 12 shorter-time units to the longer-time storage
avg-graphThis averages the 12 oldest records and shifts that value into the longer-term storage

Config files

The config files for NOCtool are intended to be written in pseudo-lisp. This is implemented as macros. Each top-level macro should have a logical name for the equipment type and create an object of a suitable class and bind *CONFIG-OBJECT* this to and re-bind *MACRO-NESTING* to a cons of *MACRO-NESTING and a suitable keyword symbol for the top-level object.

This is then used to make sure that sub-macros can check that they're in the right context for whatever configuration they provide. Sub-macros can be defined with the DEFNESTED macro.

All config files are loaded in a scrap package.

Default monitors

Each equipment class has a set of default monitors. The mapping from equipment class to monitor classes is held in *MONITOR-MAP* and they're instanciated by a call to DEFAULT-MONITORS. All default monitors that have been identified for the equipment clas and its super-class list are created and pushed onto the MONITORS slot of the equipment object (UNLESS there's already a monitor of that class in the list held in that slot). The mapping from equipment class to default monitors is done in default-settings.lisp.

Network layer

The network layer is implemented as a reader/parser/protocol dispatcher combo, where the reader is aware of the rough syntactic rules and hands over a complete read to the parser. It is done this way so incomplete syntactic units can be sent across the network, to be completed in a following transmission.

The parser is, for all intents and purposes, CL:READ-FROM-STRING (at the moment), though it binds *READ-EVAL* and *READ-BASE*.

The resulting list data structure (whose rough structure is:

(message protocol-data digest)

The protocol-data is extracted and handed over to the protocol handler loop, where different things happen depending on the state of the connection.

There are three recognised states for any given connection, they can be initial, sent-validation or validated. The state is kept in the STATE slot of the connection class.

When a connection is opened, the opening party will send an IAM protocol message (that looks roughly like "(iam sender reciever 20080401T131415)") and sent its own state as sent-validation.

Any incoming connection will be created in the initial state. The main difference between the initial and sent-validation states is how they react to an IAM protocol message. In the initial state, this triggers the sending of an IAM message before moving to the validated state, in the sent-validation case it does not.

The IAM message is used to tie a connection to a specific peer (as configured). Each peer-to-peer connection has one or two secrets (either identical or different shared secret(s)), this secret is used for a HMAC-SHA1 digest of the protocol-data. The amount of time an IAM message can be outstanding is controlled by VALIDATE-TIMESTRING.

In the validated state, protocol messages are dispatched by looking up the protocol message identifier in a hash table (called *HANDLER-MAP*). The easiest way to define a new handler is by the DEFHANDLER macro, where the lambda-list should match the protocol message, in-so-far as the first symbol listed will be used for dispatching. Also note that DEFHANDLER will add a CONN argument to the lambda list that will be the connection that caused the protocol handler to be called during execution.