Skip to content

20170830_srecon17_2.1

20170830_srecon17_2.1

Load traffic shedding

1 - Timeline

  • 2009 - 3 Main Gmail outages
  • 2010 - 1st Load sheding implementation
  • 2013 - Generaliszation

2 - Threats

  • Single Task overload
  • Causing cascading failure

  • Cluster overload

  • Whole Service overload

3 - Principles

Sustain peak performance

Accept you will throw erros

  • Task should have a “normal” capacity, what ever the traffic it receives
    *
Isolation
Criticality
  • QOS of requests (Not all requests are born equals)
  • Batch, Async, Sheddable, Critical
Cost based

QPS is not enought to measure cost
-> transition to ‘Cost-per-second’

Example

Same query != Cost

  • 1 user with 1 friend -> 1 Q = 0.1
  • 1 user with 10 0000 friends -> Q is much eavier
Retries
  • Always implement retries logici

Warning: poor retry logic

4 - Techniques

Client-side
* Efficient BUT
* What if not every client implement the stuff, you still get overloaded

Server-side
* The most efficient to mitigate load

Notes:
- Event loop latency monitor (Twisted in python)
- Resources utilizations
- Rate limiting …

-> you can then “compose” task load tracker with all these metrics

5 - Solutions

Haproxy
- Max-in-flight-requests
- Health-checking
- Circuit-breaker

  • Others L4 proxies

Need some tunning,

Software:
* Akka (Connection wrappers)

https://github.com/juju/ratelimit
- Token based implementation
- Client / Server side
- Go based

https://lyft.github.io/envoy
- “Let me wrapp it for you” category
- HTTP/2 + GRPC interface
- Proxy things
- Good for “legacy” transition

https://linked.io
- Same principle that Envoy
- Can be deployed as an “agent”

6 - Impacts on Operations

Batch handling

Enaabling application-level and business decisions

Provision below peak traffic & drop non-priority traffic

-> Allows to control “problems” and not be in an emergency situation