Skip to content

20170901_srecon17_1.5

20170901_srecon17_1.5

Cloudflare planet scale edge network monitoring with Prometheus

0 - Outline

  • What
  • Why
  • Architecture
  • Reducing alert fatigue

Take Away:
* Prometheus is the new GOLD standard
* Good monitoring doesn’t happen for free
* Monitoring is interaction with Human !

1 - What is Prometheus ?

2 - Why ?

  • Simple to operate ad deploy
  • Dynamic config
  • a Query languqge
  • Metrics usage is powerfull

  • Integration

  • Kube

3 - Architecture

Context

Usage
- Monitoring
- NOT long terme metrics storage

Overview
- 188 Prom servers
- 4 Top levels
- 250Gb of data / servers

Edge architecture
- Routing via Anycast
- POPs configured identicly
- POPs are independent

CoreDC
- Apps fore Business

PromQL

Archi

Pop
- Node_exporter running on nodes
- 1 Prom / POP
- Prom pol queries on every nodes of a POP

Core
- Pol POPs Prom

HA
- x Prom in CoreUS
- x Prom in CoreEU
- …

Retention
- 15 Days
- Scrapped every 60s
- Federation every 30s
- No down sampling

Exporters used
- Node Exporter
- Blackbox exporter
- Mtail
- Cadvisor

Deploying Exporters
- Deploy in the same “domain of Failure”

4 - Alerting

AlterManager
- In CORE dc
- Regions reporting to AlertManager

Writin alters rules
- Test query on past data
- Use descriptive names
- Alert reference
- Must have action

Dashboards
- Create DRILL-DOWN dashboards

5 - Monitoring the Monitoring

Each Prom monotir the other Prom: Mesh

6 - Tools

https://github.com/cloudflare/unsee