20170830_srecon17_2.1
20170830_srecon17_2.1
- Load traffic shedding
- 1 - Timeline
- 2 - Threats
- 3 - Principles
- 4 - Techniques
- 5 - Solutions
- 6 - Impacts on Operations
- Enaabling application-level and business decisions
Load traffic shedding
1 - Timeline
- 2009 - 3 Main Gmail outages
- 2010 - 1st Load sheding implementation
- 2013 - Generaliszation
2 - Threats
- Single Task overload
-
Causing cascading failure
-
Cluster overload
-
Whole Service overload
3 - Principles
Sustain peak performance
Accept you will throw erros
- Task should have a “normal” capacity, what ever the traffic it receives
*
Isolation
Criticality
- QOS of requests (Not all requests are born equals)
- Batch, Async, Sheddable, Critical
Cost based
QPS is not enought to measure cost
-> transition to ‘Cost-per-second’
Example
Same query != Cost
- 1 user with 1 friend -> 1 Q = 0.1
- 1 user with 10 0000 friends -> Q is much eavier
Retries
- Always implement retries logici
Warning: poor retry logic
4 - Techniques
Client-side
* Efficient BUT
* What if not every client implement the stuff, you still get overloaded
Server-side
* The most efficient to mitigate load
Notes:
- Event loop latency monitor (Twisted
in python)
- Resources utilizations
- Rate limiting …
-> you can then “compose” task load tracker with all these metrics
5 - Solutions
Haproxy
- Max-in-flight-requests
- Health-checking
- Circuit-breaker
- Others L4 proxies
Need some tunning,
Software:
* Akka (Connection wrappers)
https://github.com/juju/ratelimit
- Token based implementation
- Client / Server side
- Go based
https://lyft.github.io/envoy
- “Let me wrapp it for you” category
- HTTP/2 + GRPC interface
- Proxy things
- Good for “legacy” transition
https://linked.io
- Same principle that Envoy
- Can be deployed as an “agent”
6 - Impacts on Operations
Batch handling
Enaabling application-level and business decisions
Provision below peak traffic & drop non-priority traffic
-> Allows to control “problems” and not be in an emergency situation