In 2009, a national ISP employed a simple Nagios instance to alarm to core network issues. Oncall pages were sent to all staff at all hours, and a nominated engineer was “on call” every week.
As the ISP (and parent, managed-services company) extended our managed services to customers, engineers initially attempted to deploy similar Nagios instances per-customer. The per-customer model soon became unwieldy, since each customer / principal engineer established their own escalation rules, time-frame, and checks.
Oncall engineers were expected to support a growing customer base, but were often receiving alarms about infrastructure they were unfamiliar with. Alerting thresholds and escalation rules were applied inconsistently.😖
I was asked to design a solution to meet the following requirements:
- Platform was to be a loosely-coupled integration of best-of-breed open-source tools for monitoring, alarming, metrics, and operational support. Each component was to be updated / improved without impact to the other components.
- Authentication to all tools was to rely on existing LDAP platforms accessed over established remote access.
- All alarms were to link to actionable documentation, and duplication of documentation was to be avoided.
- A single common configuration was to be established (host criticality levels, alarm thresholds, escalations, etc), and all monitored hosts would have to fit into one of these criticality levels. (Avoiding “snowflakes”)
- No unnecessary alarms were to be sent (i.e., no point alerting an entire team about a degraded RAID array at 2am)
- All monitoring and alerting was to egress a single internal master instance.
- Solution was to be automated and templated such that extending monitoring to a new customer would be a BAU operations task.
- Solution was to be highly available, and to include “monitoring of the monitoring system”
Executing monitoring checks
Icinga (core v1) was chosen as the successor to Nagios. Icinga was originally a fork of Nagios, meaning that plugins and configuration would be compatible. Icinga brought additional features, including the ability to “expire” the acknowledgement of a problem after a given period.
(We previously found that engineers might temporarily suppress an alarm with good intentions, but quickly forget about the suppressed alarm, only to be surprised by a production failure).
Given the requirement for tightly controlled configuration, we employed an open-source Nagios configuration platform (NagioSQL), which supported multi-tenancy and a database-driven backend.
We used NagioSQL to create per-platform hostgroups (i.e., “platform-linux”), to which we applied all the necessary monitoring checks. All that was required to bring a new host under established monitoring was to add the host to NagioSQL, and associate it with the appropriate host group. All the checks would be subsequently inherited and applied.
Depending on the scale of the customer’s infrastructure, checks would either be executed directly from a master Icinga instance, or (the case with larger customers), executed via a standalone Icinga instance deployed within the customer’s environment.
Customer Icinga instances received configuration pushed out from the multi-tenanted NagioSQL platform. Customers were provisioned with read-only access to their own standalone instances, but all configuration was made by staff via NagioSQL (maintaining auditability and standards).
A series of checks on the master Icinga instance would poll the customer instances for active alarms, and represent these as a consolidated alarm towards the operations center. The operations center would respond to the consolidated alarm (“there are 3 active faults on customer XYZ”), which would link them directly to the customer Icinga instance, in order to “drill down” into the individual faults.
Linking to documentation
Mediawiki was chosen as the documentation store, using a combination of templates, forms, and “semantic” plugins to produce structured data. Templated pre-plans were created for every alarm type (i.e. “BFD traps received by SNMP”). Mediawiki templating “inference” was used to produce predictable URLs linking documentation for each node to documentation for each alarm, such that an alarm could contain a link to https://wiki/customerxyz/router3/bfd-alarm.
The final documentation presented to the engineer at each URL would be a combination of generic instructions for the alarm, specific comments applicable to each customer, and further specific instructions applicable to each precise node.
As a result of the templating hierarchy, a change to the generic instructions re handling a specific alarm could be made in a single template document, and all possible URLs generated by alarms would be updated with the change.
Integration of additional tools
Having established the basis for the loosely-coupled platform along with connectivity (RVPN) and authentication, we integrated a growing set of best-of-breed tools. The additional tools provided ops / customer capabilities such as:
- Latency graphing
- Detailed performance monitoring
- Interface and metrics graphing
- IP Address Allocation Management (IPAM)
- Auto-discovery of network nodes, resulting in automatic generation of alarming / monitoring configurations.
Training engineers on a new tool was challenging at first, since while powerful, the Icinga UI can be complex and confusing. The hierarchy of templated configuration could be unwieldy without a thorough understanding of the design. The open-source tools employed have their quirks, meaning deliberate attention to documentation and training was required to manage the platform (It’s not possible to intuitively learn to use).
It took less than a year to roll the new design out to existing customers, and to deprecate their old, bespoke configurations. We established the basic components of the monitoring platform early on (monitoring and alarming), and developed additional integrations as they became available / required. Once a customer monitoring instance was established, rolling out improvements or bug fixes became a trivial, automated process.
The updated monitoring platform underpinned all our internal and customer work.
We relied on it entirely to assure us of the state of our own, internal platforms (~200 hosts, 2,000 services), as well as customer infrastructure (~2,000 hosts, 20,000 services).
The monitoring platform and the “reverse VPN” solution (described elsewhere) were critical elements in our success in winning a contract to build and support a new nation-wide MPLS network for a power utility.
The investment in templated documentation allowed us to scale our managed services to more customers, without requiring extensive per-customer training for each oncall engineer.
Image courtesy of Arie Wubben