20170901_srecon17_1.5
20170901_srecon17_1.5
- Cloudflare planet scale edge network monitoring with Prometheus
- 0 - Outline
- 1 - What is Prometheus ?
- 2 - Why ?
- 3 - Architecture
- 4 - Alerting
- 5 - Monitoring the Monitoring
- 6 - Tools
Cloudflare planet scale edge network monitoring with Prometheus
0 - Outline
- What
- Why
- Architecture
- Reducing alert fatigue
Take Away:
* Prometheus is the new GOLD standard
* Good monitoring doesn’t happen for free
* Monitoring is interaction with Human !
1 - What is Prometheus ?
2 - Why ?
- Simple to operate ad deploy
- Dynamic config
- a Query languqge
-
Metrics usage is powerfull
-
Integration
- Kube
- …
3 - Architecture
Context
Usage
- Monitoring
- NOT long terme metrics storage
Overview
- 188 Prom servers
- 4 Top levels
- 250Gb of data / servers
Edge architecture
- Routing via Anycast
- POPs configured identicly
- POPs are independent
CoreDC
- Apps fore Business
PromQL
Archi
Pop
- Node_exporter running on nodes
- 1 Prom / POP
- Prom pol queries on every nodes of a POP
Core
- Pol POPs Prom
HA
- x Prom in CoreUS
- x Prom in CoreEU
- …
Retention
- 15 Days
- Scrapped every 60s
- Federation every 30s
- No down sampling
Exporters used
- Node Exporter
- Blackbox exporter
- Mtail
- Cadvisor
Deploying Exporters
- Deploy in the same “domain of Failure”
4 - Alerting
AlterManager
- In CORE dc
- Regions reporting to AlertManager
Writin alters rules
- Test query on past data
- Use descriptive names
- Alert reference
- Must have action
Dashboards
- Create DRILL-DOWN dashboards
5 - Monitoring the Monitoring
Each Prom monotir the other Prom: Mesh