Topics

Helm Feature Proposal: Fail-Faster Testing


黃健毅 <johnny.jy.ooi@...>
 

Good morning/afternoon/evening (delete as applicable)

I've been directed to this mailing list by Martin Hickey and Matt Fisher due to a proposal I have briefly mentioned at https://github.com/helm/helm/issues/8861

I will put more details in this email

_____

Background

When deploying a Helm chart, you normally need to wait for the resources to be deployed by Helm, then use something like "kubectl rollout deployments/{deploymentname}" to wait for the rollout to finish before running "helm test" on the chart -- as if you ran the Helm test before the rollout was completed, and your test accesses the service via the cluster service (ClusterIP/NodePort/LoadBalancer), you would get a split between the old and new versions of the chart.

And herein lies the problem. If you have a chart that might have any number of features such as:

1) has a large number of replicas
2) has a slow readiness process (e.g. due to cluster-like syncing behaviour such as ElasticSearch)

Then the rollout state will take a long time -- during which, you can't run "helm test" because of the "split brain" situation and potentially inconsistent results.

If you wait until the rollout finishes, then run the test, and find the test has failed (e.g. for a complete breakage with the update), you then need to roll back the deployment -- this then takes the same amount of time to rolling update the pods back to the previous version as it did to update them to the new version. And when you have spent 30 minutes updating the pods, the last thing you want to have to do is have to do it again.

However, from the end user/customer perspective, the service will be down 100% in between the broken update being applied, and the rollback succeeding (if it does succeed)

____

Example

Let's assume we have a four-pod service based on nginx, with our website baked into the custom image used by the pods, represented here as "O"s

OOOO

We apply the update, and the deployment has a surge of two pods, so two additional pods come up -- we'll represent these as "n"s

OOOOnn

The two new pods come good -- represented as "N"s, and Kubernetes terminates two existing pods -- represented as "X"s. At this point, if the update was bad, we now have 50% of traffic going to the two old pods (and being successful), and 50% going to the new pods (and failing)

OOXXNN

Kubernetes continues on. The terminated pods disappear, and it spins up two new pods

OONNnn

The new pods come good and it terminates the final two old pods

XXNNNN

NNNN

We now have four new pods, but at this state, 100% of the traffic going through this service is failing. We run "helm test" which fails, and we rollback back to the good state

____

Proposal

The proposal of this change is to allow (opt-in) the testing to be applied earlier, preferably during the pod rollout, so that if the test fails, the rollout is cancelled while it is still in progress, reducing the downtime impact and minimising the rollback time. Using the example above, the flow would look like this:

4 pods as before:

OOOO

2 new pods are created

OOOOnn

Those two pods come good, and Kubernetes terminates two existing pods

OOXXNN

Helm detects the new pods are now good, and actions tests on one of them (or both, if the "--parallel" switch is used) -- the pod being testing is marked with T

OOTN

Test fails on the pod. So Helm cancels the rollout and triggers a rollback of the changes.

At this precise moment in time, we have two old pods and two new pods, so traffic is split 50/50 between the pods. So instead of your service being broken for 100% of traffic, your service is broken for 50% and rollback only requires creation to two pods (with the old configuration), allowing a faster rollback and reduced downtime.

If we had 100 replicas instead of four, we could find a broken update 2% into the update (assuming a surge of 2 pods), rather than after 100% into the update, and save the time of having to roll back 100 pods,

____


Happy to respond to any questions or suggestions to this proposal.

Regards

Johnny Ooi