News from Kubernetes leadership summit
CNCF community, The Kubernetes project is at full tilt. Please see below a summary of the recent Leadership Summit, a gathering of mostly technical folk driving this project. Apologies if some hyperlinks are missing - please refer to Brian's post @ https://groups.google.com/forum/#!topic/kubernetes-dev/PpgLgkffr3o At the CNCF we want many such projects - all learning from each other. Help make that happen: As you can see the project is breaking the bounds of even modern tools and structures. There are many opportunities to help - please speak up here, or contact the relevant project leads. alexis ---------- Forwarded message ---------- From: 'Brian Grant' via Kubernetes developer/contributor discussion <kubernetes-dev@...> Date: Mon, Jun 12, 2017 at 8:18 PM Subject: Leadership summit summary and outcomes To: "kubernetes-dev@..." <kubernetes-dev@...> A group of O(100) of us met on Friday, June 2, at Samsung in San Jose. We're working on getting notes from the meeting checked into github. In the meantime, I thought I'd give a summary. Others who attended are welcome to follow up with their takeaways. Tim (@thockin) presented an overview of the state of the project. After covering the great progress we've made, we talked about having reached an inflection point in the project. There was broad consensus among those present that the project needs to increase focus on: Finishing features/APIs, especially table-stakes ones, such as Deployment, Ingress, RBAC, encrypted Secrets (as opposed to adding net new concepts) Architectural layering, modularity, project boundaries Stability, predictability, fixing bugs, paying down technical debt Easier "on ramps": user docs, examples, best practices, installers, tools, status, debugging Contributor experience, tooling, and testing Governance Conformance We discussed the need to refresh the roadmap assembled in November, which was presented by Aparna and Ihor, along with some interesting data, such as penetration of container orchestrators (~7%) and which SIGs have the most open issues (Node and API machinery). Brandon (@philips) and I presented more of the motivation for the Governance proposal, and solicited nominations for the Steering Committee. Please, please, please do comment on the governance proposal, even if just to LGTM, and seriously consider running for the Steering Committee. We asked SIGs to start working on their charters. I also spoke about the role of CNCF with respect to the project. I presented my architecture roadmap proposal, and received positive feedback. It put the extension mechanisms underway, such as API aggregation, into context. One outcome was the mandate to form SIG Architecture. An Extensibility Working Group was also discussed, but perhaps the Architecture SIG could serve the role of driving the needed extension mechanisms forward. The discussion about code organization mostly centered around the effort to scale the project to multiple github repos and orgs. Github provides exactly 2 levels of hierarchy we need to use both effectively. By multiple metrics kubernetes/kubernetes is the most active repo on Github. All of Github's mechanisms (e.g., permissions, notifications, hooks) are designed to support small, focused repos. Every other project of comparable scale is comprised of many repos (e.g., Nodejs has ~100 and CloudFoundry has ~300). The kubernetes/kubernetes commit rate peaked in July 2015, when the project was 10x smaller, and most commits on the project are already outside kubernetes/kubernetes. Additionally, there is a desire to at least start new functionality outside the main repo/release. Since Kubernetes is an API-centric system and since we're using the API machinery for component configuration as well, the API machinery needs to be made available to other repos in order for any significant development to be feasible outside kubernetes/kubernetes. We're using Service Catalog ( https://github.com/kubernetes-incubator/service-catalog) as a driving use case for this. We've also started serious work on moving out kubectl, which is at least as important symbolically as it is technically, and have stopped accepting new cloudprovider implementations. The discussion about areas falling through the cracks focused on what to do about them. There was consensus that some SIG needs to own the build machinery. Proposals included SIG release, SIG testing, SIG contributor experience, and SIG build (i.e., a new SIG). It was suggested that whatever SIG owns the build should also own utility libraries. In addition to strategies that have been discussed before (e.g., accountability metrics, leaderboard, help wanted / janitors, rotations), we discussed the idea of creating job descriptions for more project/SIG roles, as has been done for the release team, as a way to make it more clear to participating companies and individuals what areas need to be staffed. I'm looking forward to the notes from the SIG breakout, which was at the same time as the "falling through the cracks" session. It sounds like there were good discussions about SIG leadership, organization, communication, consolidation, participation, and other topics. Similar for the community effectiveness breakout, which discussed a number of topics, including how to convert passive attendees to active participants. Look for the full summit notes over the next couple weeks, as well as follow up on action items during the community hangout. Thanks to Cameron for organizing the event, to everyone else who helped with the summit, to Samsung for hosting it, and to everyone who participated. --Brian -- You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group. To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@.... To post to this group, send email to kubernetes-dev@.... To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CAKCBhs4MYHjS%3DhJTDSHCQWCtUcubOun9MKnreY5rcqerwy_GkQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
|
|
IMPORTANT - CNCF TOC Goals and Operating Principles - v0.2
Broadening beyond TOC to add CNCF GB & Marketing. CNCF community, PLEASE review this doc whose purpose is to summarise the thinking of the TOC concerning project selection, governance, and other frequently requested topics. https://docs.google.com/document/d/1Yl3IPpZnEWJRaXSBsTQ22ymQF57N5x_nHVHvmJdAj9Y/editThis is important - please do engage. Currently this document is a draft. Since the TOC operates by vote, these principles may in future become written precedent. alexis On Mon, May 15, 2017 at 4:43 PM, Alexis Richardson <alexis@...> wrote: Hi
Out of a desire to start writing down more how CNCF works, and what our principles are, Brian, Ken and I pulled some ideas into a doc:
https://docs.google.com/document/d/1Yl3IPpZnEWJRaXSBsTQ22ymQF57N5x_nHVHvmJdAj9Y/edit Comments are solicited.
Please don't be too harsh - this is just the first iteration.
alexis
|
|
CNCF Storage WG 6/9/2017 Meeting Minutes

Chris Aniszczyk
Thanks everyone for showing up for the first meeting, was great to have ~40 folks!
Here are the minutes: https://goo.gl/wRqerO
The next meeting will be in two weeks, June 23rd at 8am PT.
-- Chris Aniszczyk (@cra) | +1-512-961-6719
|
|
Re: Serverless Workgroup Kickoff
Kenneth Owens (kenowens) <kenowens@...>
No worries!
Sent from my Verizon, Samsung Galaxy smartphone
toggle quoted messageShow quoted text
-------- Original message --------
From: Alexis Richardson <alexis@...>
Date: 6/9/17 12:52 AM (GMT-06:00)
To: "Kenneth Owens (kenowens)" <kenowens@...>, cncf-toc@...
Cc: cncf-wg-serverless <cncf-wg-serverless@...>
Subject: Re: [cncf-toc] Serverless Workgroup Kickoff
Thank you all for getting this off the ground.
On Thu, 8 Jun 2017, 18:16 Kenneth Owens (kenowens) via cncf-toc, < cncf-toc@...> wrote:
Just wanted to give everyone an update on the serverless workgroup. We had our initial meeting, agreed on the objective of documenting what Serverless means to Cloud Native, outlined a white/position paper, assigned ownership to the sections/topics,
and defines a draft deliverable of July 6th.
If you’re interested in tracking the progress, joining, or getting involved, please check our Gitub page often, join the google group, or contact me directly.
https://github.com/cncf/wg-serverless

|
|
|
|
Kenneth Owens
CTO
kenowens@...
Tel:
+1 408 424 0872
|
Cisco Systems, Inc.
16401 Swingley Ridge Road Suite 400
CHESTERFIELD
63017
United States
cisco.com
|
|
Think
before you print.
|
This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is
strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.
Please
click here for Company Registration Information.
|
_______________________________________________
cncf-toc mailing list
cncf-toc@...
https://lists.cncf.io/mailman/listinfo/cncf-toc
|
|
Re: Serverless Workgroup Kickoff
Thank you all for getting this off the ground.
toggle quoted messageShow quoted text
On Thu, 8 Jun 2017, 18:16 Kenneth Owens (kenowens) via cncf-toc, < cncf-toc@...> wrote:
Just wanted to give everyone an update on the serverless workgroup. We had our initial meeting, agreed on the objective of documenting what Serverless means to Cloud Native, outlined a white/position paper, assigned ownership to the sections/topics,
and defines a draft deliverable of July 6th.
If you’re interested in tracking the progress, joining, or getting involved, please check our Gitub page often, join the google group, or contact me directly.
https://github.com/cncf/wg-serverless

|
|
|
|
Kenneth Owens
CTO
kenowens@...
Tel:
+1 408 424 0872
|
Cisco Systems, Inc.
16401 Swingley Ridge Road Suite 400
CHESTERFIELD
63017
United States
cisco.com
|
|
Think
before you print.
|
This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is
strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.
Please
click here for Company Registration Information.
|
_______________________________________________
cncf-toc mailing list
cncf-toc@...
https://lists.cncf.io/mailman/listinfo/cncf-toc
|
|
Serverless Workgroup Kickoff
Kenneth Owens (kenowens) <kenowens@...>
Just wanted to give everyone an update on the serverless workgroup. We had our initial meeting, agreed on the objective of documenting what Serverless means to Cloud Native, outlined a white/position paper, assigned ownership to the sections/topics,
and defines a draft deliverable of July 6th.
If you’re interested in tracking the progress, joining, or getting involved, please check our Gitub page often, join the google group, or contact me directly.
https://github.com/cncf/wg-serverless

|
|
|
|
Kenneth Owens
CTO
kenowens@...
Tel:
+1 408 424 0872
|
Cisco Systems, Inc.
16401 Swingley Ridge Road Suite 400
CHESTERFIELD
63017
United States
cisco.com
|
|
Think
before you print.
|
This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is
strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.
Please
click here for Company Registration Information.
|
|
|
Bassam Tabbara <Bassam.Tabbara@...>
I understand that part ;)
Sorry I was thrown off by your “what does it do differently than Amazon” :-)
The part that I don't fully grok is what you mean by "storage". Forgive my naive questions.
I mean block storage (raw block devices underneath pods comparable to AWS EBS), filesystems (shared POSIX compliant file systems comparable to AWS EFS), and Object Storage (comparable to AWS S3).
|
|
Bassam
I understand that part ;)
The part that I don't fully grok is what you mean by "storage". Forgive my naive questions.
Alexis
toggle quoted messageShow quoted text
Camille, yes! :-)
The premise of Rook is to provide a cloud-native alternative to Amazon S3, EBS and EFS. These storage services can run directly inside the cloud native environment vs. vendor specific implementations outside it. By running them inside the cluster
it enables us to to integrate them more deeply (scheduling, security, management, etc.) and enables multi-cloud and federation.
This is similar to the relationship between Istio/Envoy and Amazon ELB, Prometheus and Amazon CloudWatch, etc.
Alexis, I know you were having audio issues, I’d be happy to repeat the talk this morning if you’d like.
On Jun 6, 2017, at 3:01 PM, Camille Fournier < skamille@...> wrote:
Presumably providing an abstraction layer that can front Amazon, Google, and internal cloud offerings. For those of us who need such a thing.
|
|
Re: Continued InfraKit Discussion
David Chung <david.chung@...>
Hi Chris,
Thanks for starting the thread. In a nutshell, InfraKit focuses on being lightweight and embeddable while enabling self-management and capacity scaling features for the entire, higher-level system. There are also differences in terms of state management and deployment when compared to these larger, established tools. Brief summaries of how InfraKit compares to Terraform, BOSH and Digital Rebar are included here. Feedback, comments and questions are most welcome.
Thanks, David
InfraKit vs TerraformTerraform and InfraKit share the features of declarative infrastructure. Terraform can provision infrastructure resources and reconcile differences between infrastructure and user specification. InfraKit builds on the same concepts and adds continuously running microservices or controllers, so that the reconciliation is continuous rather than on-demand. InfraKit also provides features such as node scaling group so that cluster capacity scaling can be triggered directly from higher-level orchestration systems. In this use case, InfraKit and terraform can be complementary, with terraform being the ‘execution backend’ for InfraKit. As an example, there is a terraform plugin in the project which can be leveraged to provision resource types currently not natively supported by InfraKit. Terraform and InfraKit differ in terms of state management and deployment. InfraKit can leverage the system it is embedded into, such as Docker Swarm or etcd, for leader election and persistence of user specs. Unlike Terraform which has its own schema and different backend plugins for storing infrastructure state, InfraKit derives the infrastructure state via introspection and queries supported by the lower-level platform (tags and metadata). The infrastructure itself is the master of records, rather than a representation of it which can be corrupted or go out of sync. InfraKit can be a part of the control plane of a higher-level orchestration system without additional resource requirements. In this use case, InfraKit also readily integrates with the higher level system as service containers, while Terraform is primarily a CLI tool for ops and not meant to be run as a service for higher level systems. InfraKit is designed as an active system continuously monitoring your infrastructure and reconciling its reality with a specification provided either by a user or a higher level orchestration system, with an immutable infrastructure approach, while Terraform is designed as a tool to automate existing ops workflows using a declarative description of infrastructure, without the active monitoring and reconciliation approach. InfraKit vs BOSHBOSH is much more vertically integrated than InfraKit in that it is itself a full-fledged distributed application with its own datastore (Postgres) and server components. In terms of workflow, BOSH also covers areas outside of infrastructure provisioning and management -- it has opinionated workflows for application definition and software releases which reaches into the upper layers of the CNCF model. The full system serves as part of control plane of a much larger platform and thus requires its own provisioning and management. InfraKit takes a lighter weight approach. As a set of pluggable microservices, it is meant to be embedded into a larger system to minimize additional operational burdens. With InfraKit embedded in the same control plane, these systems can self-install and self-manage. InfraKit also does not require a database such as Postgres to store state. Instead, InfraKit inspects the infrastructure and reconstructs what is in the environment for reconciliation with user specification. This means the only thing that can diverge from user spec is the true infrastructure state. If divergence is detected, InfraKit will attempt to correct it. InfraKit also can play the role of a cloud-provider for a node group autoscaler for a container orchestrator like k8s or Docker Swarm, thereby bringing the elasticity common in the public cloud to on-prem bare-metal or virtualized environments. InfraKit vs Digital Rebar Digital Rebar compares similarly to RackHD or MaaS. Currently there is community integration with RackHD, and there is a MaaS plugin in the InfraKit repo. At a high-level, Digital Rebar and InfraKit are complementary in that InfraKit can leverage Digital Rebar for bare-metal provisioning via DR’s PXE/DHCP/TFTP infrastructure, while InfraKit provides the orchestration capabilities such as scaling up/down clusters and rolling updates. Digital Rebar is itself a full-fledged distributed application of many components, including a Postgres database where it stores infrastructure state. In terms of deployment, it has its own control plane that needs to be provisioned and maintained. InfraKit can be embedded inside the control plane of a higher-level orchestration system such as Kubernetes and Docker Swarm. InfraKit leverages these systems for persistence of user infrastructure specification and for leader election (to provide high availability), and infrastructure state is reconstructed by introspecting the infrastructure. This means that InfraKit is more a ‘kit’ then another platform and higher-level systems can incorporate InfraKit to provide self-managing and self-provisioning capabilities.
toggle quoted messageShow quoted text
On Tue, Jun 6, 2017 at 8:41 AM, Chris Aniszczyk <caniszczyk@...> wrote: From today's CNCF TOC call, there was some discussion on how InfraKit compares to Terraform, BOSH and Digital Rebar. Thanks again to David for taking the time to present.
Let's use this thread to have that discussion.
--
|
|
Bassam Tabbara <Bassam.Tabbara@...>
Camille, yes! :-)
The premise of Rook is to provide a cloud-native alternative to Amazon S3, EBS and EFS. These storage services can run directly inside the cloud native environment vs. vendor specific implementations outside it. By running them inside the cluster
it enables us to to integrate them more deeply (scheduling, security, management, etc.) and enables multi-cloud and federation.
This is similar to the relationship between Istio/Envoy and Amazon ELB, Prometheus and Amazon CloudWatch, etc.
Alexis, I know you were having audio issues, I’d be happy to repeat the talk this morning if you’d like.
toggle quoted messageShow quoted text
On Jun 6, 2017, at 3:01 PM, Camille Fournier < skamille@...> wrote:
Presumably providing an abstraction layer that can front Amazon, Google, and internal cloud offerings. For those of us who need such a thing.
|
|
Presumably providing an abstraction layer that can front Amazon, Google, and internal cloud offerings. For those of us who need such a thing.
toggle quoted messageShow quoted text
On Jun 6, 2017 5:55 PM, "Alexis Richardson via cncf-toc" < cncf-toc@...> wrote: I don't understand what you are doing which is better than what Amazon can do alone.
Do you mean EBS? For S3 which has 11 9’s (over a year) of durability you’d need 10s of Rook storage nodes spread across three availability zones.
For block storage, i.e. the use case in the original thread, you can make do with a smaller number of nodes. You can run with two, but I would recommend you deploy more.
On Jun 6, 2017, at 2:43 PM, Alexis Richardson < alexis@...> wrote:
So you need a pair of rook nodes to be more reliable than S3?
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.
However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.
On Jun 6, 2017, at 2:22 PM, Alexis Richardson < alexis@...> wrote:
Ok so in the single node case you are less reliable than S3?
No. There are no writes to S3 at all, and that would slow down write significantly.
If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.
On Jun 6, 2017, at 2:13 PM, Alexis Richardson < alexis@...> wrote:
Does the first storage node write to S3 before or after pushing updates to the other nodes
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:
- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.
Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential.
On Jun 6, 2017, at 1:58 PM, Alexis Richardson < alexis@...> wrote:
Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:
- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).
Now the failure cases:
- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.
To guard against such failures, most systems (including Rook) do the following:
- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.
In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization
schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.
Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.
> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this. I missed the details of that. Can
>> you say more please?
>>
>> alexis
>>
>>
_______________________________________________
cncf-toc mailing list
cncf-toc@...
https://lists.cncf.io/mailman/listinfo/cncf-toc
|
|
I don't understand what you are doing which is better than what Amazon can do alone.
toggle quoted messageShow quoted text
Do you mean EBS? For S3 which has 11 9’s (over a year) of durability you’d need 10s of Rook storage nodes spread across three availability zones.
For block storage, i.e. the use case in the original thread, you can make do with a smaller number of nodes. You can run with two, but I would recommend you deploy more.
On Jun 6, 2017, at 2:43 PM, Alexis Richardson < alexis@...> wrote:
So you need a pair of rook nodes to be more reliable than S3?
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.
However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.
On Jun 6, 2017, at 2:22 PM, Alexis Richardson < alexis@...> wrote:
Ok so in the single node case you are less reliable than S3?
No. There are no writes to S3 at all, and that would slow down write significantly.
If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.
On Jun 6, 2017, at 2:13 PM, Alexis Richardson < alexis@...> wrote:
Does the first storage node write to S3 before or after pushing updates to the other nodes
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:
- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.
Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential.
On Jun 6, 2017, at 1:58 PM, Alexis Richardson < alexis@...> wrote:
Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:
- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).
Now the failure cases:
- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.
To guard against such failures, most systems (including Rook) do the following:
- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.
In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization
schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.
Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.
> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this. I missed the details of that. Can
>> you say more please?
>>
>> alexis
>>
>>
|
|
Bassam Tabbara <Bassam.Tabbara@...>
Do you mean EBS? For S3 which has 11 9’s (over a year) of durability you’d need 10s of Rook storage nodes spread across three availability zones.
For block storage, i.e. the use case in the original thread, you can make do with a smaller number of nodes. You can run with two, but I would recommend you deploy more.
toggle quoted messageShow quoted text
On Jun 6, 2017, at 2:43 PM, Alexis Richardson < alexis@...> wrote:
So you need a pair of rook nodes to be more reliable than S3?
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.
However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.
On Jun 6, 2017, at 2:22 PM, Alexis Richardson < alexis@...> wrote:
Ok so in the single node case you are less reliable than S3?
No. There are no writes to S3 at all, and that would slow down write significantly.
If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.
On Jun 6, 2017, at 2:13 PM, Alexis Richardson < alexis@...> wrote:
Does the first storage node write to S3 before or after pushing updates to the other nodes
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:
- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.
Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential.
On Jun 6, 2017, at 1:58 PM, Alexis Richardson < alexis@...> wrote:
Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:
- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).
Now the failure cases:
- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.
To guard against such failures, most systems (including Rook) do the following:
- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.
In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization
schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.
Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.
> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this. I missed the details of that. Can
>> you say more please?
>>
>> alexis
>>
>>
|
|
So you need a pair of rook nodes to be more reliable than S3?
toggle quoted messageShow quoted text
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.
However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.
On Jun 6, 2017, at 2:22 PM, Alexis Richardson < alexis@...> wrote:
Ok so in the single node case you are less reliable than S3?
No. There are no writes to S3 at all, and that would slow down write significantly.
If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.
On Jun 6, 2017, at 2:13 PM, Alexis Richardson < alexis@...> wrote:
Does the first storage node write to S3 before or after pushing updates to the other nodes
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:
- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.
Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential.
On Jun 6, 2017, at 1:58 PM, Alexis Richardson < alexis@...> wrote:
Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:
- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).
Now the failure cases:
- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.
To guard against such failures, most systems (including Rook) do the following:
- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.
In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization
schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.
Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.
> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this. I missed the details of that. Can
>> you say more please?
>>
>> alexis
>>
>>
|
|
Bassam Tabbara <Bassam.Tabbara@...>
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.
However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.
toggle quoted messageShow quoted text
On Jun 6, 2017, at 2:22 PM, Alexis Richardson < alexis@...> wrote:
Ok so in the single node case you are less reliable than S3?
No. There are no writes to S3 at all, and that would slow down write significantly.
If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.
On Jun 6, 2017, at 2:13 PM, Alexis Richardson < alexis@...> wrote:
Does the first storage node write to S3 before or after pushing updates to the other nodes
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:
- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.
Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential.
On Jun 6, 2017, at 1:58 PM, Alexis Richardson < alexis@...> wrote:
Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:
- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).
Now the failure cases:
- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.
To guard against such failures, most systems (including Rook) do the following:
- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.
In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization
schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.
Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.
> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this. I missed the details of that. Can
>> you say more please?
>>
>> alexis
>>
>>
|
|
Ok so in the single node case you are less reliable than S3?
toggle quoted messageShow quoted text
No. There are no writes to S3 at all, and that would slow down write significantly.
If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.
On Jun 6, 2017, at 2:13 PM, Alexis Richardson < alexis@...> wrote:
Does the first storage node write to S3 before or after pushing updates to the other nodes
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:
- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.
Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential.
On Jun 6, 2017, at 1:58 PM, Alexis Richardson < alexis@...> wrote:
Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:
- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).
Now the failure cases:
- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.
To guard against such failures, most systems (including Rook) do the following:
- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.
In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization
schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.
Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.
> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this. I missed the details of that. Can
>> you say more please?
>>
>> alexis
>>
>>
|
|
Bassam Tabbara <Bassam.Tabbara@...>
No. There are no writes to S3 at all, and that would slow down write significantly.
If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.
toggle quoted messageShow quoted text
On Jun 6, 2017, at 2:13 PM, Alexis Richardson < alexis@...> wrote:
Does the first storage node write to S3 before or after pushing updates to the other nodes
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:
- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.
Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential.
On Jun 6, 2017, at 1:58 PM, Alexis Richardson < alexis@...> wrote:
Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:
- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).
Now the failure cases:
- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.
To guard against such failures, most systems (including Rook) do the following:
- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.
In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization
schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.
Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.
> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this. I missed the details of that. Can
>> you say more please?
>>
>> alexis
>>
>>
|
|
Does the first storage node write to S3 before or after pushing updates to the other nodes
toggle quoted messageShow quoted text
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:
- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.
Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential.
On Jun 6, 2017, at 1:58 PM, Alexis Richardson < alexis@...> wrote:
Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:
- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).
Now the failure cases:
- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.
To guard against such failures, most systems (including Rook) do the following:
- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.
In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization
schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.
Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.
> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this. I missed the details of that. Can
>> you say more please?
>>
>> alexis
>>
>>
|
|
Bassam Tabbara <Bassam.Tabbara@...>
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:
- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.
Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential.
toggle quoted messageShow quoted text
On Jun 6, 2017, at 1:58 PM, Alexis Richardson < alexis@...> wrote:
Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:
- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).
Now the failure cases:
- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.
To guard against such failures, most systems (including Rook) do the following:
- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.
In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization
schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.
Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.
> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this. I missed the details of that. Can
>> you say more please?
>>
>> alexis
>>
>>
|
|
Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).
toggle quoted messageShow quoted text
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:
- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).
Now the failure cases:
- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.
To guard against such failures, most systems (including Rook) do the following:
- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.
In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.
Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.
> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this. I missed the details of that. Can
>> you say more please?
>>
>> alexis
>>
>>
|
|