Helm Feature Proposal: Fail-Faster Testing

黃健毅 <johnny.jy.ooi@...>

Good morning/afternoon/evening (delete as applicable)

I've been directed to this mailing list by Martin Hickey and Matt Fisher due to a proposal I have briefly mentioned at https://github.com/helm/helm/issues/8861

I will put more details in this email



When deploying a Helm chart, you normally need to wait for the resources to be deployed by Helm, then use something like "kubectl rollout deployments/{deploymentname}" to wait for the rollout to finish before running "helm test" on the chart -- as if you ran the Helm test before the rollout was completed, and your test accesses the service via the cluster service (ClusterIP/NodePort/LoadBalancer), you would get a split between the old and new versions of the chart.

And herein lies the problem. If you have a chart that might have any number of features such as:

1) has a large number of replicas
2) has a slow readiness process (e.g. due to cluster-like syncing behaviour such as ElasticSearch)

Then the rollout state will take a long time -- during which, you can't run "helm test" because of the "split brain" situation and potentially inconsistent results.

If you wait until the rollout finishes, then run the test, and find the test has failed (e.g. for a complete breakage with the update), you then need to roll back the deployment -- this then takes the same amount of time to rolling update the pods back to the previous version as it did to update them to the new version. And when you have spent 30 minutes updating the pods, the last thing you want to have to do is have to do it again.

However, from the end user/customer perspective, the service will be down 100% in between the broken update being applied, and the rollback succeeding (if it does succeed)



Let's assume we have a four-pod service based on nginx, with our website baked into the custom image used by the pods, represented here as "O"s


We apply the update, and the deployment has a surge of two pods, so two additional pods come up -- we'll represent these as "n"s


The two new pods come good -- represented as "N"s, and Kubernetes terminates two existing pods -- represented as "X"s. At this point, if the update was bad, we now have 50% of traffic going to the two old pods (and being successful), and 50% going to the new pods (and failing)


Kubernetes continues on. The terminated pods disappear, and it spins up two new pods


The new pods come good and it terminates the final two old pods



We now have four new pods, but at this state, 100% of the traffic going through this service is failing. We run "helm test" which fails, and we rollback back to the good state



The proposal of this change is to allow (opt-in) the testing to be applied earlier, preferably during the pod rollout, so that if the test fails, the rollout is cancelled while it is still in progress, reducing the downtime impact and minimising the rollback time. Using the example above, the flow would look like this:

4 pods as before:


2 new pods are created


Those two pods come good, and Kubernetes terminates two existing pods


Helm detects the new pods are now good, and actions tests on one of them (or both, if the "--parallel" switch is used) -- the pod being testing is marked with T


Test fails on the pod. So Helm cancels the rollout and triggers a rollback of the changes.

At this precise moment in time, we have two old pods and two new pods, so traffic is split 50/50 between the pods. So instead of your service being broken for 100% of traffic, your service is broken for 50% and rollback only requires creation to two pods (with the old configuration), allowing a faster rollback and reduced downtime.

If we had 100 replicas instead of four, we could find a broken update 2% into the update (assuming a surge of 2 pods), rather than after 100% into the update, and save the time of having to roll back 100 pods,


Happy to respond to any questions or suggestions to this proposal.


Johnny Ooi