Skip to content

20180502_talk_4

20180502_talk_4

CI upgrading K8s

1 - Infra overview

Overview
* 84 Clusters
* 366 Aws accounts

Before
* Accounts per team
* All instances are the same
* PowerUser access to Production
* You built it, you run EVERYHING

->

K8s
* 1 cluster per product (Multiple Teams)
* Intances are not managed by team
* Hands Off approach
* A lot of stuff out of the box

2 - Phylosophy

  • No pet clusters: No tweaking for 80 clusters
  • Always update with latest k8s release
  • Continuous NOT disuptive

3 - Cluster Setup

3.1 - Overview
  • AWS Provision resources
  • ETCD Stack outside of k8s resources
  • CoreOS based image
  • Multi-AZ workers nodes
  • HA ControlPlane with ELB
  • Cluster config in Git
  • e2e tests with Jenkins
  • Changes rolled out with Cluster Lifecycle Manager (OpenSource since Friday 20180427)

+ ClusterMetada (Cluster Registry)

cluster/
|
- cluster.yaml
- etcd-cluster.yaml
- Manifests/
  - ... services running in k8s
-
3.2 - CLM

Manager lookup
- API server
- ClusterRegistry
- GitRepo

  1. User: Creates a cluster via the Clusterregistry
  2. CLM Creates the cluster resources via AWS API
  3. CLM Pushes to GitRepo
3.3 - Workflow

We use Git repo with 3 branches as “release channels”
- Dev: 1 clusters
- Alpha: 3
- Beta: 80+
- Stable

4 - E2E tests

  • Upstream Conformance Tests
  • Statefulsets Tests
  • Zalendo tests (Custom to our integration)

github.com/mikkeloscar/kuberntes-e2e

docker run ...
  mikkeloscar/kuberntes-e2e:latest \
  -focus "\[Conformance\]" \
  -skip "\[...\]"

5 - Live node upgrade

5.1 - Naive aproach

Concept:
* Use Autoscaling capabilities
* Add 1 new node (With new version)
* Drain an old node, ASG delete and recreate a new one

Issue:
* Volume accross AZs
* No differences between Master / Workers
* No clear definition of “OK, ready, do next”

Solution:
* KubeletNodeReady meta
* ASG InService meta
* ELB InService meta

5.2 - PodDisruptionBudget

Resource in k8s

5.3 - Postgres Operator
5.4 - Other Issues notes
  • Flannel store config in ETCD
  • Took down internal docker registry, while updating too many nodes that did not have every images locally ....