20180502_talk_4
20180502_talk_4
- CI upgrading K8s
- 1 - Infra overview
- 2 - Phylosophy
- 3 - Cluster Setup
- 4 - E2E tests
- 5 - Live node upgrade
CI upgrading K8s
1 - Infra overview
Overview
* 84 Clusters
* 366 Aws accounts
Before
* Accounts per team
* All instances are the same
* PowerUser access to Production
* You built it, you run EVERYHING
->
K8s
* 1 cluster per product (Multiple Teams)
* Intances are not managed by team
* Hands Off approach
* A lot of stuff out of the box
2 - Phylosophy
- No pet clusters: No tweaking for 80 clusters
- Always update with latest k8s release
- Continuous NOT disuptive
3 - Cluster Setup
3.1 - Overview
- AWS Provision resources
- ETCD Stack outside of k8s resources
- CoreOS based image
- Multi-AZ workers nodes
- HA ControlPlane with ELB
- Cluster config in Git
- e2e tests with Jenkins
- Changes rolled out with
Cluster Lifecycle Manager
(OpenSource since Friday 20180427)
+ ClusterMetada (Cluster Registry)
cluster/ | - cluster.yaml - etcd-cluster.yaml - Manifests/ - ... services running in k8s -
3.2 - CLM
Manager lookup
- API server
- ClusterRegistry
- GitRepo
- User: Creates a cluster via the Clusterregistry
- CLM Creates the cluster resources via AWS API
- CLM Pushes to GitRepo
3.3 - Workflow
We use Git repo with 3 branches as “release channels”
- Dev: 1 clusters
- Alpha: 3
- Beta: 80+
- Stable
4 - E2E tests
- Upstream Conformance Tests
- Statefulsets Tests
- Zalendo tests (Custom to our integration)
github.com/mikkeloscar/kuberntes-e2e
docker run ... mikkeloscar/kuberntes-e2e:latest \ -focus "\[Conformance\]" \ -skip "\[...\]"
5 - Live node upgrade
5.1 - Naive aproach
Concept:
* Use Autoscaling capabilities
* Add 1 new node (With new version)
* Drain an old node, ASG delete and recreate a new one
Issue:
* Volume accross AZs
* No differences between Master / Workers
* No clear definition of “OK, ready, do next”
Solution:
* KubeletNodeReady
meta
* ASG InService
meta
* ELB InService
meta
5.2 - PodDisruptionBudget
Resource in k8s
5.3 - Postgres Operator
5.4 - Other Issues notes
- Flannel store config in ETCD
- Took down internal docker registry, while updating too many nodes that did not have every images locally ....