Engineering, Operations & Cloud Security


OpenVPN at Scale. Part 2: Monitoring!

First, an embarrassing story.

In part 1 we talked about moving and consolidating OpenVPN and shifting authentication to Okta SSO. That was in early July.

This the second in a two-part series. Be sure to read Part 1: Okta SSO VPN!

We also opted to use a Let’s Encrypt certificate and for reasons I can’t remember, didn’t automate certificate renewal.

Fast forward to “early September” and guess what? While users could still make VPN connections, the web UI was inaccessible… because we forgot to renew the certificate.

Embarrassing story aside, it forced us to come up with an improved set of alerts and an overall dashboard.

A couple of the important things we wanted to capture were:

  • Concurrent user count. OpenVPN license is based on concurrently logged in users and we want to know when we are consistently at 80% of that license limit.
  • SSL certificate expiry. Once bitten, twice shy. What more to say?
  • General health. For this, we thought to track connection duration and bandwidth in/out per user.
  • Observability/monitoring. All of this data should be emitted into our time-series platform, Wavefront.

Getting the data

OpenVPN has a command-line utility, sacli, and two options give machine-readable/JSON output:

root@vpn:~# /usr/local/openvpn_as/scripts/sacli --help | grep VPNS
  VPNStatus              -> show current VPN status
  VPNSummary             -> show current VPN summary

VPNSummary prints out the number of currently logged in users:

root@vpn:~# /usr/local/openvpn_as/scripts/sacli VPNSummary
  "n_clients": 17

While VPNStatus prints out a JSON blob of all connected users, in a format similar to:

    "client_list_header": {
      "Bytes Received": 4,
      "Bytes Sent": 5,
      "Client ID": 9,
      "Common Name": 0,
      "Connected Since": 6,
      "Connected Since (time_t)": 7,
      "Peer ID": 10,
      "Real Address": 1,
      "Username": 8,
      "Virtual Address": 2,
      "Virtual IPv6 Address": 3

Turning it into time-series metrics

We use Telegraf for much of our data collection and ingestion into Wavefront. VPNSummary already output nicely formatted JSON that Telegraf could consume.

VPNStatus, on the other hand, required a bit of Python code. We used the Telegraf exec input plugin for both:

  commands = ["/usr/local/openvpn_as/scripts/sacli VPNSummary"]
  data_format = "json"
  timeout = "5s"

  ## measurement name suffix (for separating different commands)
  name_suffix = ".summary"
  name_override = "vpn"

  commands = ["python /opt/telegraf/"]

  data_format = "influx"
  timeout = "5s"

  ## measurement name suffix (for separating different commands)
  name_override = "vpn"

What about SSL?

We also added a simple x509_cert check:

  ## List certificate sources
  sources = ["https://vpn:443"]
  tls_ca = "/opt/telegraf/lets-encrypt-x3-cross-signed.pem"
  interval = "5m"

  ## Timeout for SSL connection
  # timeout = "5s"

Dashboards! Alerts!

The fun part of having the data or metrics available is creating a dashboard and alerts.

The dashboard on the right is meant for a quick health overview of OpenVPN. Notably, concurrent users & SSL expiration.

And importantly, we wanted to make sure we catch SSL certification expirations.

Get the code

Since this is probably useful to others, we put our work up on GitHub:

  • scripts: contains helper Python script to convert the output from VPNStatus
  • telegraf: contains the two Telegraf configuration files
  • wavefront: this has the Wavefront dashboard, in JSON

Write a Comment