Infrastructure monitoring

We have a server (overwatch) monitoring the rest of our infrastructure. It runs InfluxDB and Grafana. This page describes the configuration of the whole stack.

Overwatch runs InfluxDB to store recorded metrics, and Grafana to visualize them and send alerts. Every server (including overwatch itself) runs Telegraf to gather metrics and send them to InfluxDB.

All the setup of InfluxDB, Telegraf, Grafana, and the Apache reverse proxy is done by Ansible playbooks. See the monitoring-server.yml playbook in the kde-ansible repository, and the roles it references. Don’t change files like grafana.ini or influxdb.conf directly on the server; the changes will be overwritten next time someone applies the playbook.

The Ansible playbooks are complete enough to replicate the environment locally. There is a Vagrantfile in kde-ansible/vagrant/monitoring to automate this.

InfluxDB

The Telegraf instances on each server send metrics directly to InfluxDB over port 8086. We don’t have a webserver proxying in front (maybe we should?), and we don’t have SSL for the InfluxDB port (maybe we should?).

There is a single database “telegraf”. It has a retention policy “autogen” where received data is stored. This RP has the duration set pretty much as high as the disk capacity lets us (currently about 7 months). There is a big margin to reduce it in the future, since we don’t really need to keep high-resolution metrics for anywhere near that long, but if we have the disk space, we might as well use it.

Note

On 2020-05-09 we installed Telegraf into all servers, which nearly doubled the amount of data we store. This means the new data coming in is larger than the old data being expired by the retention policy, so once in a while we get disk space warnings and we need to reduce the retention duration some more.

However, on 2020-07-14 we also got rid of per-core CPU stats, which freed a lot of disk space, so we should be fine for a while.

There is another retention policy “one_month” (duration 30 days), which keeps data with 1-minute resolution, and a third called “old_data” (infinite retention) keeping data with 5-minute resolution. (The one_month RP ended up being useless, since the raw data in autogen is kept for longer than that. It also should be named according to its resolution rather than its retention duration, since the duration can be adjusted, but renaming a RP is troublesome).

Aggregation of raw data into the lower-resolution RPs is done by a large set of continuous queries, one for each measurement. For most measurements we keep min, max, and average, so that a query like max(used) FROM ram can be accurately converted to max(max_used) FROM old_data.ram_agg when querying old aggregated data. (TODO: publish the script to automate creation of these CQs).

Grafana

Grafana runs on port 3000 on localhost only. It’s reverse-proxied by Apache, on virtualhost status.kde.org.

Authentication is integrated with MyKDE. Access is restricted to the groups “sysadmins” (which is given Editor access) and “server-status” (which is given Viewer access). The latter lets us give specific non-sysadmin users permission to see existing graphs.

Note that not even sysadmins get the “Admin” role in Grafana. To login as a Grafana administrator (which should be rarely necessary), use the “admin” account.

Apache

Fairly standard setup. (TODO: document what that means in another page!) SSL is done by Let’s Encrypt. Every request to status.kde.org (except for the letsencrypt ACME challenge) is forwarded to Grafana.