Ever wondered how to check the status of your services, visualize CPU utilization and memory consumption when running high workloads or get alerted if the request counts looks abnormal? This is the space of observability and in this article we’re going to not only deploy a production-ready observability solution to the internet, but we will develop and deploy an application to put it to use.
We will setup Prometheus and Grafana with TLS support and basic authentication. These applications will be hosted on Hetzner behind Cloudflare using NGINX as our reverse proxy. We will break down the architecture, each of the components and understand alternatives that may fit well in your setup.
Here is a look at the final dashboard we will see for our demo application in Grafana.
I knew I had the following constraints:
ryan-schachte.com
using a sub-domain like metrics.ryan-schachte.com
and monitoring.ryan-schachte.com
Let’s break down the architecture in smaller chunks:
/metrics
endpoint that Prometheus will scrape. From here, we can start visualizing almost anything imaginable.The first thing we need to do is create some TLS certificate(s) for our site. You can either segregate the certificates to match the subdomain directly or use a wildcard matcher to simplify the amount of certificates you need to manage. I started with the former and realized the latter is much easier for long-term management.
In this case, I’m using Cloudflare, but you can easily utilize self-signed certificated via OpenSSL or just run Let’s Encrypt yourself if you don’t want to deal with Cloudflare as your third-party TLS manager.
In the above, we want to leverage the Full
encryption for TLS. This will ensure that the communication from the client all the way to the origin (NGINX) is encrypted. In the case that you manage and control the origin server, Full
is a great option. If you want TLS, but you cannot modify the TLS settings on the origin, then something like Flexible
might be better. Flexible will allow you to maintain TLS with termination happening at the Cloudflare layer. The downside is the communication from Cloudflare to the origin would run over HTTP and exposes the application to MITM attacks.
Within your zone, let’s create a TLS certificate
We want to specify a wildcard matcher to be a catch-all for all the subdomains we have. This keeps future TLS management simple.
From here, you’ll be able to grab the key
(keep this private) and the origin certificate
(this can be public).
Let’s save these files somewhere safe because they will become relevant as we continue deeper into this tutorial.
Prometheus configuration will be done in 2 parts
Prometheus.yml
docker-compose.yaml
We’ll keep the configuration fairly vanilla, but explain each of the blocks:
📋 Click to copy1global: 2 scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. 3 evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. 4 5alerting: 6 alertmanagers: 7 - static_configs: 8 - targets: 9 10rule_files: 11 12scrape_configs: 13 - job_name: "prometheus" 14 static_configs: 15 - targets: ["localhost:9090"] 16 17 - job_name: "node" 18 static_configs: 19 - targets: ["node-exporter:9100"]
scrape_interval
.evaluation_interval
specifies how often we evaluate rules as we modify our configuration.We will keep this file on the VPS and mount it via docker-compose so we can easily modify it as we add more nodes or adjust the configuration in the future.
Grafana configuration will be done in 2 parts. Because we do TLS termination at the reverse proxy layer, configuration is actually pretty easy for Grafana. We will maintain a fairly vanilla setup for the purposes of this tutorial.
grafana.ini
docker-compose.yaml
NGINX is a big component of this architecture because it handles TLS termination for us. If you’re unfamiliar with TLS termination, the main idea is that the encrypted tunnel that enables encrypted communication between the client and server ends here. That means that any further data forwarded from NGINX to the application containers are unencrypted.
As mentioned above, this is OK because the app server runs adjacent to NGINX on the same node, so opportunities for things like MITM (man-in-the-middle) attacks aren’t relevant.
If you proxy data from NGINX to nodes outside the container, then you want to ensure you have firewalled and subnetted your services appropriately if you terminate at the reverse proxy layer. This VPC/subnet configuration is beyond the scope of this tutorial. If this is applicable to you and you want a simple config, look into ufw
and whitelisting the IP of your NGINX reverse proxy appropriately on the app server nodes to prevent external access from unwanted parties.
Let’s take a peek at the nginx.conf
to override the defaults to better fit our application.
📋 Click to copy1events { 2 worker_connections 4096; ## Default: 1024 3} 4 5http { 6 map $http_upgrade $connection_upgrade { 7 default upgrade; 8 '' close; 9 } 10 11 upstream metrics { 12 server localhost:3000; 13 } 14 15 upstream prom { 16 server localhost:9090; 17 } 18 19 server { 20 server_name grafana.ryan-schachte.com; 21 listen 443 ssl; 22 ssl_certificate universal_cert.pem; 23 ssl_certificate_key universal_key.pem; 24 ssl_protocols TLSv1 TLSv1.1 TLSv1.2; 25 ssl_ciphers HIGH:!aNULL:!MD5; 26 27 proxy_http_version 1.1; 28 proxy_set_header Upgrade $http_upgrade; 29 proxy_set_header Connection $connection_upgrade; 30 proxy_set_header Host $http_host; 31 32 location / { 33 add_header Content-Security-Policy "script-src: 'unsafe-eval' 'unsafe-inline';"; 34 proxy_pass http://metrics; 35 } 36 } 37 38 server { 39 server_name prometheus.ryan-schachte.com; 40 listen 443 ssl; 41 ssl_certificate universal_cert.pem; 42 ssl_certificate_key universal_key.pem; 43 ssl_protocols TLSv1 TLSv1.1 TLSv1.2; 44 ssl_ciphers HIGH:!aNULL:!MD5; 45 46 location / { 47 auth_basic "Administrator’s Area"; 48 auth_basic_user_file .htpasswd; 49 proxy_pass http://prom; 50 } 51 } 52}
As mentioned previously, NGINX will be our reverse proxy that also handles TLS termination. As a result, we will use the previously created TLS certificates from Cloudflare to handle that termination. This will allow us to visit our routes with the https://
prefix and encrypt our traffic from the client to the origin.
The rest is fairly self-explanatory, but it’s worth pointing out a few things:
proxy_pass
directives will forward traffic to the appropriate Docker containers.Prometheus
is backed by basic authentication. This is using a .htpasswd
file to handle the users who are able to access the server. The passwords are generated using bcrypt.htpasswd -b <file> <user>
In newer versions of Grafana, a bug was introduced affecting Grafana servers behind a reverse proxy. To ensure this works correctly, it’s imperative to add the following line:
proxy_set_header Host $http_host;
My VPS has Docker-CE installed. I personally find, for homelabbing, that Docker is still great for me. I’ve deployed Kubernetes in the past and just find the overhead to be unnecessarily complicated for lightweight app server deployments.
With Docker Compose, we can manage all the related services in a single file and deploy them. It’s in my to-do to automate these deployments from a Github Action or similar CI system, but we’ll keep it simple for this tutorial.
📋 Click to copy1volumes: 2 prometheus_data: {} 3 grafana_storage: {} 4 5services: 6 node-exporter: 7 image: prom/node-exporter:latest 8 container_name: node-exporter 9 restart: unless-stopped 10 volumes: 11 - /proc:/host/proc:ro 12 - /sys:/host/sys:ro 13 - /:/rootfs:ro 14 command: 15 - '--path.procfs=/host/proc' 16 - '--path.rootfs=/rootfs' 17 - '--path.sysfs=/host/sys' 18 - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' 19 ports: 20 - "9100:9100" 21 networks: 22 - net 23 24 prometheus: 25 image: prom/prometheus:latest 26 container_name: prometheus 27 restart: unless-stopped 28 volumes: 29 - ./prom.yml:/etc/prometheus/prometheus.yml 30 - ./universal_cert.pem:/etc/prometheus/cert.pem 31 - ./universal_key.pem:/etc/prometheus/key.pem 32 - ./web.yml:/etc/prometheus/web.yml 33 - prometheus_data:/prometheus 34 command: 35 - '--config.file=/etc/prometheus/prometheus.yml' 36 - '--storage.tsdb.path=/prometheus' 37 - '--web.console.libraries=/etc/prometheus/console_libraries' 38 - '--web.console.templates=/etc/prometheus/consoles' 39 - '--web.enable-lifecycle' 40 ports: 41 - "9090:9090" 42 networks: 43 - net 44 45 grafana: 46 image: grafana/grafana-enterprise 47 container_name: grafana 48 volumes: 49 - grafana_storage:/var/lib/grafana 50 - ./grafana.ini:/etc/grafana/grafana.ini 51 - ./universal_cert.pem:/etc/grafana/cert.pem 52 - ./universal_key.pem:/etc/grafana/key.pem 53 restart: unless-stopped 54 ports: 55 - "3000:3000" 56 networks: 57 - net 58 59networks: 60 net: {}
We’ll discuss some key points here, but the file is self-explanatory for those familiar with docker-compose.
--web.enable-lifecycle
will allow us to make prometheus modifications without requiring an entire app server restart via HTTP POST.Let’s take advantage of our new setup by collecting application metrics and having Prometheus scrape them. We will then visualize the data in Grafana.
Let’s create a new Go application. We will initialize using go mod init github.com/schachte/prometheus-article
. From here we can add the libraries needed to instrument metrics.
I’m not going to reinvent the wheel here, so following the basic https://prometheus.io/docs/guides/go-application/ guide will be great for getting the initial project set up.
Install the required dependencies
📋 Click to copy1go get github.com/prometheus/client_golang/prometheus 2go get github.com/prometheus/client_golang/prometheus/promauto 3go get github.com/prometheus/client_golang/prometheus/promhttp
main.go
📋 Click to copy1package main 2 3import ( 4 "net/http" 5 "time" 6 7 "github.com/prometheus/client_golang/prometheus" 8 "github.com/prometheus/client_golang/prometheus/promauto" 9 "github.com/prometheus/client_golang/prometheus/promhttp" 10) 11 12// Run this infinitely to demonstrate metrics collection 13// at a larger scale 14func recordMetrics() { 15 go func() { 16 for { 17 opsProcessed.Inc() 18 time.Sleep(2 * time.Second) 19 } 20 }() 21} 22 23// This is the counter metric 24var ( 25 opsProcessed = promauto.NewCounter(prometheus.CounterOpts{ 26 Name: "schachte_processed_ops_total", 27 Help: "The total number of processed events", 28 }) 29) 30 31func main() { 32 recordMetrics() 33 34 // Host the metrics endpoint 35 // This is what the main Prometheus server will scrape once we configure 36 // a new scrape job in the Prometheus YAML 37 http.Handle("/metrics", promhttp.Handler()) 38 http.ListenAndServe(":2112", nil) 39}
Let’s explain this code a bit. The main function will kick off once we run the binary and we will host a metrics endpoint on port 2112
until the process is killed. We’ll handle the networking for this shortly.
You’ll notice we invoke this infinite loop that emits a counter metric every 2 seconds. This is cool because we can use this metric to visualize rate of increase for our metric in Grafana.
We won’t go into detail as to how Prometheus works and what metrics are supported, but we can assume a counter metric is an ever increasing value. We can leverage this value to understand percentiles, rates of increase, etc. for particular time ranges and much more.
Now that we have a new server, let’s better understand the configuration updates and networking to scrape these metrics and pull them into Prometheus.
📋 Click to copy1global: 2 scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. 3 evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. 4 5alerting: 6 alertmanagers: 7 - static_configs: 8 - targets: 9 10rule_files: 11 12scrape_configs: 13 - job_name: "prometheus" 14 static_configs: 15 - targets: ["localhost:9090"] 16 17 - job_name: "node" 18 static_configs: 19 - targets: ["node-exporter:9100"] 20 21 - job_name: "demo" 22 static_configs: 23 - targets: ["metrics-demo:2112"]
The addition to the Prometheus.yaml
above is the new metrics-demo
job. Let’s point out a couple of things:
metrics-demo
with the IP address of the node.2112
. We will assume that we will port forward 2112 to 2112 outside of the container to keep the port mapping easier to remember.This is the beauty of Prometheus because this is all we need to do. Shortly, we will understand how to validate the correctness of this configuration via the Prometheus dashboard.
Now that we created a simple binary to deploy, let’s Dockerize it. This will simplify how we can pull and deploy the code onto our VPS.
📋 Click to copy1FROM golang:1.18-bullseye 2 3ENV GO111MODULE=on 4ENV GOFLAGS=-mod=vendor 5 6WORKDIR "/app" 7 8COPY . ./ 9RUN go mod vendor 10RUN go build -o metrics . 11 12CMD ["./metrics"]
We’ll ensure we have Go present by using the Go Bullseye base image. From here, it’s just a matter of copying the files into the image we want to build, installing the dependencies and setting the default entry point.
docker build -t metrics-demo:latest .
If you’re building this image on an M1 mac and running on a Linux AMD-based system, you will need to adjust the build parameters accordingly.
docker buildx build --platform linux/amd64 -t metrics-demo:latest .
We can then push this image to our public or private registry using something like
📋 Click to copy1docker login 2docker push metrics-demo:latest
For my case, I will be using my own private Docker registry, but you are also open to use the public Docker registry for free.
We can adjust our docker-compose.yaml
file and verify we can hit the metrics
📋 Click to copy1... 2metrics-demo: 3 restart: always 4 image: metrics-demo-amd:latest 5 ports: 6 - 2112:2112
Then run curl localhost:2112
to see metrics output to stdout. Once you see this output
📋 Click to copy1.... 2process_virtual_memory_max_bytes 1.8446744073709552e+19 3promhttp_metric_handler_requests_in_flight 1 4promhttp_metric_handler_requests_total{code="200"} 1 5promhttp_metric_handler_requests_total{code="500"} 0 6promhttp_metric_handler_requests_total{code="503"} 0 7schachte_processed_ops_total 845 8....
We can redeploy and verify the node is online via Prometheus:
From here, let’s query the data in Prometheus to see our metric increasing.
Let’s check out the average rate of increase over the last 1 minute. This should match our code (1 every 2 seconds).
rate
will show us the average rate of increase per second of samples collected from the range vector(s) present in the timeseries data queried.
For 1 minute we would have 4 samples if we scrape every 15 seconds (15 * 4 = 60 seconds). rate
will average these totals together and give us the per second rate of increase. Multiplying the result by 60 gets us the per minute average. Since the metrics emit at a constant rate, we see a straight line around 30.
Let’s build the following:
The final dashboard will look like this
We will add our Prometheus data source. In our case, we have already setup and deployed Prometheus behind our domain, so we can use the domain directly. Depending on your setup or if you're following along locally, you could also target localhost
or the IP address of your node.
Take note that we have Basic Auth Details
filled out. This is because we have auth in front of Prometheus via NGINX. From here, we can begin creating a new dashboard.
Let's start with understanding system uptime and visualizing when our node goes down. Prometheus provides a query to tell us if a node is up or down via the following: up{}
you can get specific and ask what service is up or down and the query will return a 1
for online and 0
for offline (up{instance="metrics-demo:2112", job="demo"}
).
So how we can leverage this to build a fancy uptime monitoring panel?
Because Prometheus is our data source, we can plug the PromQL
directly into Grafana and see the translation directly.
You can get fancier and modify the value mappings in the side-panel to get textual mappings from number to English like so:
Next, let's look at the total request count over the last 24 hours.
After choosing Prometheus as your data source, you can plug in the following query increase(schachte_processed_ops_total[24h])
and understand the rate of increase over a 24 hour period. You can tweak some of the naming and coloring in the side-panel.
Be sure to check out the Prometheus docs to see what other cool metrics you can instrument and visualize!
Prometheus and Grafana are awesome tools to have in your observability toolbelt. Not only is this used in the enterprise world, but it can also be used with hobby projects too. I've posted all the relevant code below. Be sure to leave a comment or email me if you have any questions!