In appreciation of Prometheus' engineering

Even though I rejected Prometheus as a choice in my last blog post about Netdata, I actually appreciate Prometheus' engineering quality. From its documentation it is apparent that the authors are very experienced on the subject and have thought through things.

This post reviews some of the things that demonstrate that, namely their responses to the push vs pull debacle, the way they limit Prometheus' scope, the way their alerting system is designed and documented, and the way they treat storage.

Push vs pull and the myth of unscalability

Prometheus' pull model comes over as a bit unusual at first. Indeed, a Google search reveals that it is one of its most controversial properties. There were myths about Prometheus' pull model being unscalable. However the Prometheus authors do a good job of dispelling this myth, even for multiple definitions of "unscalable":

Some people think about how Nagios was unscalable, and that too was a pull system. The Prometheus authors explain why Nagios' scalability problems are caused by something else, namely its subprocess management model.
The Prometheus authors assert that no matter who initiates the connection, the payload is much bigger than the connection establishment overhead.
Some people think push is better because you can use UDP to avoid congestion. The Prometheus authors assert that the actual work being done, e.g. persisting metrics to disk, is still much bigger than the TCP/IP overhead. They also provide a back of the envelope calculation on how far you can push the pull model: you can monitor about 10000 machines with a single Prometheus server.
Some people are confused about Prometheus' scope, and think Prometheus also handles event-based data (more on that later). The authors state that, indeed, for those kinds of data a pull model would be unscalable, but Prometheus doesn't handle that kind of data.
Some people are worried about operational scalability, namely that the Prometheus server needs to know about all the targets. The Prometheus authors provide arguments on why knowing the targets is actually a good thing:

"If your monitoring system doesn't know what the world should look like and which monitored service instances should be there, how would it be able to tell when an instance just never reports in, is down due to an outage, or really is no longer meant to exist?"

Prometheus' authors give additional arguments on why they think pull is "slightly better than push". From their FAQ and blog:

You can run your monitoring on your laptop when developing changes.
You can more easily tell if a target is down.
You can manually go to a target and inspect its health with a web browser.
You can replicate your monitoring by just running another Prometheus server, no need to reconfigure your targets.
Push makes it easier to take down the monitoring server by sending too much data.

Scope

Monitoring is a big topic. It is easy to devolve into a jack-of-all-trades system with many features. Prometheus sets clear boundaries on its scope: it is only for aggregate time series data. The Prometheus authors define other kinds of monitoring data for the sake of defining their own scope:

Event-based data, where each individual event is reported, e.g. HTTP requests, exceptions.
Log data.

Storage

Prometheus' storage engine is simple yet well thought through. Its properties are clearly documented.

It is also another example of limiting scope. The authors acknowledge that scalable storage is a very hard problem. Instead of trying to solve that by themselves, e.g. by providing features to replicate, shard and backup storage, they took the following approach:

They only wrote a "simple & good enough" solution. This covers the base case and is great for getting started.
They allow integration with multiple specialized storage solutions if you really need scalability. For example InfluxDB or TimescaleDB.

Alerting

The documentation about alerting really shows off the Prometheus authors' experience in the department of alerting. They separate their alerting system into two parts:

The Prometheus server analyzes time series and generates alerts, which are sent to the Alertmanager.
The Alertmanager component performs grouping, silencing, inhibition and forwarding alerts to email, Slack, PagerDuty, etc.

The documentation clearly explains what grouping, silencing and inhibition mean, why they are useful, and what they mean to the user. Additionally, the authors have also documented many alerting best practices, with specific advise for different kinds of workloads.

"Aim to have as few alerts as possible, by alerting on symptoms that are associated with end-user pain rather than trying to catch every possible way that pain could be caused. Alerts should link to relevant consoles and make it easy to figure out which component is at fault. Allow for slack in alerting to accommodate small blips."

So Prometheus is awesome

Now, if only they would have some sort of usable dashboard by default…