

20180502_talk_4

Overview
* 84 Clusters
* 366 Aws accounts

Before
* Accounts per team
* All instances are the same
* PowerUser access to Production
* You built it, you run EVERYHING

->

K8s
* 1 cluster per product (Multiple Teams)
* Intances are not managed by team
* Hands Off approach
* A lot of stuff out of the box

AWS Provision resources
ETCD Stack outside of k8s resources
CoreOS based image
Multi-AZ workers nodes
HA ControlPlane with ELB
Cluster config in Git
e2e tests with Jenkins
Changes rolled out with Cluster Lifecycle Manager (OpenSource since Friday 20180427)

+ ClusterMetada (Cluster Registry)

cluster/
|
- cluster.yaml
- etcd-cluster.yaml
- Manifests/
  - ... services running in k8s
-

Manager lookup
- API server
- ClusterRegistry
- GitRepo

We use Git repo with 3 branches as “release channels”
- Dev: 1 clusters
- Alpha: 3
- Beta: 80+
- Stable

github.com/mikkeloscar/kuberntes-e2e

docker run ...
  mikkeloscar/kuberntes-e2e:latest \
  -focus "\[Conformance\]" \
  -skip "\[...\]"

Concept:
* Use Autoscaling capabilities
* Add 1 new node (With new version)
* Drain an old node, ASG delete and recreate a new one

Issue:
* Volume accross AZs
* No differences between Master / Workers
* No clear definition of “OK, ready, do next”

Solution:
* KubeletNodeReady meta
* ASG InService meta
* ELB InService meta

Flannel store config in ETCD
Took down internal docker registry, while updating too many nodes that did not have every images locally ....