
Davanum Srinivas
Marek, Sahdev, It has been a few months since this email about lack of enough hands to help with etcd. Has the situation improved at all?
Paris, RichiH,
Any feedback you are hearing with your Dev Rep hats on?
ChrisA, Looks like this came up in both TOC and GB meetings, but we have feedback that the folks working hard on etcd are not really seeing changes in their day-to-day work. Anything we can do from the CNCF side to help?
Do we all want to meet on a call? I can offer up an upcoming TOC call to talk about this? Please don't wait for the call to discuss this, feel free to send your thoughts/ideas/status here on this thread.
thanks, Dims
toggle quoted message
Show quoted text
On Mon, Mar 7, 2022 at 1:11 PM 'Marek Siarkowicz' via steering < steering@...> wrote: We (@serathius, @ptabor) are reaching out to K8s steering committee to bring to their attention recent changes in and the current state of the etcd community.
In the last few months, primary maintainers Gyuho Lee (@gyuho, Amazon, announcement) and Sam Batschelet (@hexfusion, Red Hat) have stopped actively participating in the project. This leaves the project with only one active and two occasionally-reviewing maintainers, Marek Siarkowicz (@serathius, Google), Piotr Tabor (@ptabor, Google), both are relatively new to the project (1 month and 1 year of tenure) and Sahdev P Zala (spzala@, IBM). Other maintainers are either dormant or have very minimal activity over the last six months. The project is effectively unmaintained.
This lack of maintainers is impacting the community: Cannot make important project decisions (like conflict resolution) based on governance as it requires a supermajority of maintainers to agree. This has especially bad impact on the design process, where major proposals don’t get enough feedback and scrutiny. Due to lack of maintainer activity, we cannot introduce a proper approval process, resulting in important features getting reviews from only one maintainer. For example #13168 was reviewed by only @ptabor (relatively new maintainer) and @lilic (reviewer, no longer active in project). Unable to reliably triage issues and release bug fixes. Fixes for critical bugs can take months to be released, causing users to lose trust and not adopt new releases. For example v3.5 was released with multiple critical bugs (#13196, #13192) and it took the community over a quarter to release fixes, making it unusable in production. As of v1.23.3 Kubernetes still recommends the mostly broken Etcd version v3.5.0 (#106589). Slowed or blocked contributions. In theory all changes should be reviewed by 2 maintainers before submitting. A second view-point is especially important for Etcd, to ensure security and correctness of changes, as they can be difficult to verify. We have been forced to break this rule and rely on lazy consensus, making the whole process error prone. In case of a mistake we are only able to verify them via prod-releases (which are 2 years apart). There is no healthy feedback loop due to maintainers changing too frequently.
Etcd is a critical dependency of Kubernetes. If the situation in etcd doesn’t improve it will create a significant risk for the future of the K8s project. This may impede improvements in K8s reliability or other areas that require changes on the etcd side. It may also lead to a situation where a severe etcd bug, like data corruption, gets detected after it’s already present in tens or hundreds of thousands of Kubernetes clusters around the globe. This could irreparably break users' trust in Kubernetes.
We're hoping that by bringing this to attention we can start discussing and planning making proper steps to mitigate the issue.
Thanks,
Marek
--
You received this message because you are subscribed to the Google Groups "steering" group.
To unsubscribe from this group and stop receiving emails from it, send an email to steering+unsubscribe@....
To view this discussion on the web visit https://groups.google.com/a/kubernetes.io/d/msgid/steering/CAJs3Yt1%3DvTgMAMvY6Lk%3D5L3X7fhg9FV%2BHKMCb4Et-AX-TNWf%3DA%40mail.gmail.com.
|
|

Alex Chircop
On Mon, Jul 18, 2022 at 6:51 PM Davanum Srinivas < davanum@...> wrote: Marek, Sahdev, It has been a few months since this email about lack of enough hands to help with etcd. Has the situation improved at all?
Paris, RichiH,
Any feedback you are hearing with your Dev Rep hats on?
ChrisA, Looks like this came up in both TOC and GB meetings, but we have feedback that the folks working hard on etcd are not really seeing changes in their day-to-day work. Anything we can do from the CNCF side to help?
Do we all want to meet on a call? I can offer up an upcoming TOC call to talk about this? Please don't wait for the call to discuss this, feel free to send your thoughts/ideas/status here on this thread.
thanks, Dims
On Mon, Mar 7, 2022 at 1:11 PM 'Marek Siarkowicz' via steering < steering@...> wrote: We (@serathius, @ptabor) are reaching out to K8s steering committee to bring to their attention recent changes in and the current state of the etcd community.
In the last few months, primary maintainers Gyuho Lee (@gyuho, Amazon, announcement) and Sam Batschelet (@hexfusion, Red Hat) have stopped actively participating in the project. This leaves the project with only one active and two occasionally-reviewing maintainers, Marek Siarkowicz (@serathius, Google), Piotr Tabor (@ptabor, Google), both are relatively new to the project (1 month and 1 year of tenure) and Sahdev P Zala (spzala@, IBM). Other maintainers are either dormant or have very minimal activity over the last six months. The project is effectively unmaintained.
This lack of maintainers is impacting the community: Cannot make important project decisions (like conflict resolution) based on governance as it requires a supermajority of maintainers to agree. This has especially bad impact on the design process, where major proposals don’t get enough feedback and scrutiny. Due to lack of maintainer activity, we cannot introduce a proper approval process, resulting in important features getting reviews from only one maintainer. For example #13168 was reviewed by only @ptabor (relatively new maintainer) and @lilic (reviewer, no longer active in project). Unable to reliably triage issues and release bug fixes. Fixes for critical bugs can take months to be released, causing users to lose trust and not adopt new releases. For example v3.5 was released with multiple critical bugs (#13196, #13192) and it took the community over a quarter to release fixes, making it unusable in production. As of v1.23.3 Kubernetes still recommends the mostly broken Etcd version v3.5.0 (#106589). Slowed or blocked contributions. In theory all changes should be reviewed by 2 maintainers before submitting. A second view-point is especially important for Etcd, to ensure security and correctness of changes, as they can be difficult to verify. We have been forced to break this rule and rely on lazy consensus, making the whole process error prone. In case of a mistake we are only able to verify them via prod-releases (which are 2 years apart). There is no healthy feedback loop due to maintainers changing too frequently.
Etcd is a critical dependency of Kubernetes. If the situation in etcd doesn’t improve it will create a significant risk for the future of the K8s project. This may impede improvements in K8s reliability or other areas that require changes on the etcd side. It may also lead to a situation where a severe etcd bug, like data corruption, gets detected after it’s already present in tens or hundreds of thousands of Kubernetes clusters around the globe. This could irreparably break users' trust in Kubernetes.
We're hoping that by bringing this to attention we can start discussing and planning making proper steps to mitigate the issue.
Thanks,
Marek
--
You received this message because you are subscribed to the Google Groups "steering" group.
To unsubscribe from this group and stop receiving emails from it, send an email to steering+unsubscribe@....
To view this discussion on the web visit https://groups.google.com/a/kubernetes.io/d/msgid/steering/CAJs3Yt1%3DvTgMAMvY6Lk%3D5L3X7fhg9FV%2BHKMCb4Et-AX-TNWf%3DA%40mail.gmail.com.
--
This email and any attachments are confidential to the intended recipient and may also be privileged or copyrighted material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient please delete it from your system and notify the sender. StorageOS Ltd is a company registered in England and Wales with company number 09614942. Registered office address: 2 Minton Place, Victoria Road, Bicester, Oxfordshire, OX26 6QB.
|
|
Marek Siarkowicz <siarkowicz@...>
toggle quoted message
Show quoted text
On Mon, Jul 18, 2022 at 7:52 PM Davanum Srinivas < davanum@...> wrote: Marek, Sahdev, It has been a few months since this email about lack of enough hands to help with etcd. Has the situation improved at all?
Paris, RichiH,
Any feedback you are hearing with your Dev Rep hats on?
ChrisA, Looks like this came up in both TOC and GB meetings, but we have feedback that the folks working hard on etcd are not really seeing changes in their day-to-day work. Anything we can do from the CNCF side to help?
Do we all want to meet on a call? I can offer up an upcoming TOC call to talk about this? Please don't wait for the call to discuss this, feel free to send your thoughts/ideas/status here on this thread.
thanks, Dims
On Mon, Mar 7, 2022 at 1:11 PM 'Marek Siarkowicz' via steering < steering@...> wrote: We (@serathius, @ptabor) are reaching out to K8s steering committee to bring to their attention recent changes in and the current state of the etcd community.
In the last few months, primary maintainers Gyuho Lee (@gyuho, Amazon, announcement) and Sam Batschelet (@hexfusion, Red Hat) have stopped actively participating in the project. This leaves the project with only one active and two occasionally-reviewing maintainers, Marek Siarkowicz (@serathius, Google), Piotr Tabor (@ptabor, Google), both are relatively new to the project (1 month and 1 year of tenure) and Sahdev P Zala (spzala@, IBM). Other maintainers are either dormant or have very minimal activity over the last six months. The project is effectively unmaintained.
This lack of maintainers is impacting the community: Cannot make important project decisions (like conflict resolution) based on governance as it requires a supermajority of maintainers to agree. This has especially bad impact on the design process, where major proposals don’t get enough feedback and scrutiny. Due to lack of maintainer activity, we cannot introduce a proper approval process, resulting in important features getting reviews from only one maintainer. For example #13168 was reviewed by only @ptabor (relatively new maintainer) and @lilic (reviewer, no longer active in project). Unable to reliably triage issues and release bug fixes. Fixes for critical bugs can take months to be released, causing users to lose trust and not adopt new releases. For example v3.5 was released with multiple critical bugs (#13196, #13192) and it took the community over a quarter to release fixes, making it unusable in production. As of v1.23.3 Kubernetes still recommends the mostly broken Etcd version v3.5.0 (#106589). Slowed or blocked contributions. In theory all changes should be reviewed by 2 maintainers before submitting. A second view-point is especially important for Etcd, to ensure security and correctness of changes, as they can be difficult to verify. We have been forced to break this rule and rely on lazy consensus, making the whole process error prone. In case of a mistake we are only able to verify them via prod-releases (which are 2 years apart). There is no healthy feedback loop due to maintainers changing too frequently.
Etcd is a critical dependency of Kubernetes. If the situation in etcd doesn’t improve it will create a significant risk for the future of the K8s project. This may impede improvements in K8s reliability or other areas that require changes on the etcd side. It may also lead to a situation where a severe etcd bug, like data corruption, gets detected after it’s already present in tens or hundreds of thousands of Kubernetes clusters around the globe. This could irreparably break users' trust in Kubernetes.
We're hoping that by bringing this to attention we can start discussing and planning making proper steps to mitigate the issue.
Thanks,
Marek
--
You received this message because you are subscribed to the Google Groups "steering" group.
To unsubscribe from this group and stop receiving emails from it, send an email to steering+unsubscribe@....
To view this discussion on the web visit https://groups.google.com/a/kubernetes.io/d/msgid/steering/CAJs3Yt1%3DvTgMAMvY6Lk%3D5L3X7fhg9FV%2BHKMCb4Et-AX-TNWf%3DA%40mail.gmail.com.
--
|
|
Marek Siarkowicz <siarkowicz@...>
On Mon, Jul 18, 2022 at 7:52 PM Davanum Srinivas < davanum@...> wrote: Marek, Sahdev, It has been a few months since this email about lack of enough hands to help with etcd. Has the situation improved at all?
Benjamin Wang from VMware became a new maintainer, however our capacity didn't grow much as Piotr Tabor is on long holidays (till October). Not sure if he still plans to continue to be active after as he no longer works on etcd at Google. There was some progress on issues mentioned in the original email, however the underlying issues were not addressed: * We patched the etcd governance, it now officially supports lazy consensus after 2 weeks. However, we are still struggling with restoring unwritten knowledge that was lost with previous maintainers. For example, interaction with CNCF. We just discovered that the only remaining active etcd maintainers were unaware and did't have access to CNCF helpdesk. https://github.com/cncf/foundation/pull/387 <- still waiting * In the last couple of months we discovered and fixed a data inconsistency issue that was hiding within untriaged issues ( postmortem). However, it took us over a year to fix, the number of new untriaged issues doesn't go down ( #14138), we get new reports about critical issues ( #14211, #14143, #14098) and we are still unable to qualify the latest release. In my opinion there is still a significant risk of undiscovered issues present in v3.5 release.* With Benjamin joining we just managed to fill all the release manager positions ( #13912). We have enough capacity to review and merge bug fixes. However, we still don't review non-bugfix. As this creates bad experience for new contributors that are unaware of this policy, I'm planning to make it official that etcd doesn't accept new features until we are happy with reliability and qualification.cc +paris.pittman@...
|
|
I think that it's more than worthwhile to have a senior+ community engineer onboard to the etcd maintainer crew and frankly, think all graduated projects should have this kind of support for someone part time (50%) or full time. (see: /issues/43). Are there any orgs that could step up and provide this support now? Would the etcd maintainer folks welcome this?
The communities are too large and the maintainer burden is too high to do the necessary work to build and then maintain the community. Things that a senior+ community engineer could help the etcd crew with: - video tutorials for reviewing code/advanced contributing/maintainer training - run your community meetings - ama sessions for new contributors and those interested in maintaining - outreach for new maintainers/future maintainers - help with continuity and institutional knowledge gathering for onboarding and offboarding maintainers - detailed contributing and developer guides
toggle quoted message
Show quoted text
On Mon, Jul 18, 2022 at 1:49 PM Marek Siarkowicz < siarkowicz@...> wrote: On Mon, Jul 18, 2022 at 7:52 PM Davanum Srinivas < davanum@...> wrote: Marek, Sahdev, It has been a few months since this email about lack of enough hands to help with etcd. Has the situation improved at all?
Benjamin Wang from VMware became a new maintainer, however our capacity didn't grow much as Piotr Tabor is on long holidays (till October). Not sure if he still plans to continue to be active after as he no longer works on etcd at Google. There was some progress on issues mentioned in the original email, however the underlying issues were not addressed: * We patched the etcd governance, it now officially supports lazy consensus after 2 weeks. However, we are still struggling with restoring unwritten knowledge that was lost with previous maintainers. For example, interaction with CNCF. We just discovered that the only remaining active etcd maintainers were unaware and did't have access to CNCF helpdesk. https://github.com/cncf/foundation/pull/387 <- still waiting * In the last couple of months we discovered and fixed a data inconsistency issue that was hiding within untriaged issues ( postmortem). However, it took us over a year to fix, the number of new untriaged issues doesn't go down ( #14138), we get new reports about critical issues ( #14211, #14143, #14098) and we are still unable to qualify the latest release. In my opinion there is still a significant risk of undiscovered issues present in v3.5 release.* With Benjamin joining we just managed to fill all the release manager positions ( #13912). We have enough capacity to review and merge bug fixes. However, we still don't review non-bugfix. As this creates bad experience for new contributors that are unaware of this policy, I'm planning to make it official that etcd doesn't accept new features until we are happy with reliability and qualification.cc +paris.pittman@...
|
|