

20170901_srecon17_1.5

Take Away:
* Prometheus is the new GOLD standard
* Good monitoring doesn’t happen for free
* Monitoring is interaction with Human !

Usage
- Monitoring
- NOT long terme metrics storage

Overview
- 188 Prom servers
- 4 Top levels
- 250Gb of data / servers

Edge architecture
- Routing via Anycast
- POPs configured identicly
- POPs are independent

CoreDC
- Apps fore Business

PromQL

Pop
- Node_exporter running on nodes
- 1 Prom / POP
- Prom pol queries on every nodes of a POP

Core
- Pol POPs Prom

HA
- x Prom in CoreUS
- x Prom in CoreEU
- …

Retention
- 15 Days
- Scrapped every 60s
- Federation every 30s
- No down sampling

Exporters used
- Node Exporter
- Blackbox exporter
- Mtail
- Cadvisor

Deploying Exporters
- Deploy in the same “domain of Failure”

AlterManager
- In CORE dc
- Regions reporting to AlertManager

Writin alters rules
- Test query on past data
- Use descriptive names
- Alert reference
- Must have action

Dashboards
- Create DRILL-DOWN dashboards

Each Prom monotir the other Prom: Mesh