Date   

News from Kubernetes leadership summit

alexis richardson
 

CNCF community,

The Kubernetes project is at full tilt.

Please see below a summary of the recent Leadership Summit, a
gathering of mostly technical folk driving this project. Apologies if
some hyperlinks are missing - please refer to Brian's post @
https://groups.google.com/forum/#!topic/kubernetes-dev/PpgLgkffr3o

At the CNCF we want many such projects - all learning from each other.
Help make that happen: As you can see the project is breaking the
bounds of even modern tools and structures. There are many
opportunities to help - please speak up here, or contact the relevant
project leads.

alexis



---------- Forwarded message ----------
From: 'Brian Grant' via Kubernetes developer/contributor discussion
<kubernetes-dev@googlegroups.com>
Date: Mon, Jun 12, 2017 at 8:18 PM
Subject: Leadership summit summary and outcomes
To: "kubernetes-dev@googlegroups.com" <kubernetes-dev@googlegroups.com>


A group of O(100) of us met on Friday, June 2, at Samsung in San Jose.
We're working on getting notes from the meeting checked into github.
In the meantime, I thought I'd give a summary. Others who attended are
welcome to follow up with their takeaways.

Tim (@thockin) presented an overview of the state of the project.
After covering the great progress we've made, we talked about having
reached an inflection point in the project. There was broad consensus
among those present that the project needs to increase focus on:

Finishing features/APIs, especially table-stakes ones, such as
Deployment, Ingress, RBAC, encrypted Secrets (as opposed to adding net
new concepts)
Architectural layering, modularity, project boundaries
Stability, predictability, fixing bugs, paying down technical debt
Easier "on ramps": user docs, examples, best practices, installers,
tools, status, debugging
Contributor experience, tooling, and testing
Governance
Conformance

We discussed the need to refresh the roadmap assembled in November,
which was presented by Aparna and Ihor, along with some interesting
data, such as penetration of container orchestrators (~7%) and which
SIGs have the most open issues (Node and API machinery).

Brandon (@philips) and I presented more of the motivation for the
Governance proposal, and solicited nominations for the Steering
Committee. Please, please, please do comment on the governance
proposal, even if just to LGTM, and seriously consider running for the
Steering Committee. We asked SIGs to start working on their charters.
I also spoke about the role of CNCF with respect to the project.

I presented my architecture roadmap proposal, and received positive
feedback. It put the extension mechanisms underway, such as API
aggregation, into context. One outcome was the mandate to form SIG
Architecture. An Extensibility Working Group was also discussed, but
perhaps the Architecture SIG could serve the role of driving the
needed extension mechanisms forward.

The discussion about code organization mostly centered around the
effort to scale the project to multiple github repos and orgs. Github
provides exactly 2 levels of hierarchy we need to use both
effectively. By multiple metrics kubernetes/kubernetes is the most
active repo on Github. All of Github's mechanisms (e.g., permissions,
notifications, hooks) are designed to support small, focused repos.
Every other project of comparable scale is comprised of many repos
(e.g., Nodejs has ~100 and CloudFoundry has ~300). The
kubernetes/kubernetes commit rate peaked in July 2015, when the
project was 10x smaller, and most commits on the project are already
outside kubernetes/kubernetes.

Additionally, there is a desire to at least start new functionality
outside the main repo/release. Since Kubernetes is an API-centric
system and since we're using the API machinery for component
configuration as well, the API machinery needs to be made available to
other repos in order for any significant development to be feasible
outside kubernetes/kubernetes. We're using Service Catalog
(https://github.com/kubernetes-incubator/service-catalog) as a driving
use case for this. We've also started serious work on moving out
kubectl, which is at least as important symbolically as it is
technically, and have stopped accepting new cloudprovider
implementations.

The discussion about areas falling through the cracks focused on what
to do about them. There was consensus that some SIG needs to own the
build machinery. Proposals included SIG release, SIG testing, SIG
contributor experience, and SIG build (i.e., a new SIG). It was
suggested that whatever SIG owns the build should also own utility
libraries. In addition to strategies that have been discussed before
(e.g., accountability metrics, leaderboard, help wanted / janitors,
rotations), we discussed the idea of creating job descriptions for
more project/SIG roles, as has been done for the release team, as a
way to make it more clear to participating companies and individuals
what areas need to be staffed.

I'm looking forward to the notes from the SIG breakout, which was at
the same time as the "falling through the cracks" session. It sounds
like there were good discussions about SIG leadership, organization,
communication, consolidation, participation, and other topics.

Similar for the community effectiveness breakout, which discussed a
number of topics, including how to convert passive attendees to active
participants.

Look for the full summit notes over the next couple weeks, as well as
follow up on action items during the community hangout.

Thanks to Cameron for organizing the event, to everyone else who
helped with the summit, to Samsung for hosting it, and to everyone who
participated.

--Brian

--
You received this message because you are subscribed to the Google
Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to kubernetes-dev+unsubscribe@googlegroups.com.
To post to this group, send email to kubernetes-dev@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/kubernetes-dev/CAKCBhs4MYHjS%3DhJTDSHCQWCtUcubOun9MKnreY5rcqerwy_GkQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


IMPORTANT - CNCF TOC Goals and Operating Principles - v0.2

alexis richardson
 

Broadening beyond TOC to add CNCF GB & Marketing.


CNCF community,

PLEASE review this doc whose purpose is to summarise the thinking of
the TOC concerning project selection, governance, and other frequently
requested topics.

https://docs.google.com/document/d/1Yl3IPpZnEWJRaXSBsTQ22ymQF57N5x_nHVHvmJdAj9Y/edit

This is important - please do engage. Currently this document is a
draft. Since the TOC operates by vote, these principles may in future
become written precedent.

alexis



On Mon, May 15, 2017 at 4:43 PM, Alexis Richardson <alexis@weave.works> wrote:
Hi

Out of a desire to start writing down more how CNCF works, and what
our principles are, Brian, Ken and I pulled some ideas into a doc:

https://docs.google.com/document/d/1Yl3IPpZnEWJRaXSBsTQ22ymQF57N5x_nHVHvmJdAj9Y/edit

Comments are solicited.

Please don't be too harsh - this is just the first iteration.

alexis


CNCF Storage WG 6/9/2017 Meeting Minutes

Chris Aniszczyk
 

Thanks everyone for showing up for the first meeting, was great to have ~40 folks!

Here are the minutes: https://goo.gl/wRqerO

Here is the full recording: https://youtu.be/qAw3y6rdRbs

The CSI slides presented are here: https://goo.gl/cjkKZE

The next meeting will be in two weeks, June 23rd at 8am PT. 

--
Chris Aniszczyk (@cra) | +1-512-961-6719


Re: Serverless Workgroup Kickoff

Kenneth Owens (kenowens) <kenowens@...>
 

No worries! 



Sent from my Verizon, Samsung Galaxy smartphone


-------- Original message --------
From: Alexis Richardson <alexis@...>
Date: 6/9/17 12:52 AM (GMT-06:00)
To: "Kenneth Owens (kenowens)" <kenowens@...>, cncf-toc@...
Cc: cncf-wg-serverless <cncf-wg-serverless@...>
Subject: Re: [cncf-toc] Serverless Workgroup Kickoff

Thank you all for getting this off the ground.


On Thu, 8 Jun 2017, 18:16 Kenneth Owens (kenowens) via cncf-toc, <cncf-toc@...> wrote:

Just wanted to give everyone an update on the serverless workgroup. We had our initial meeting, agreed on the objective of documenting what Serverless means to Cloud Native, outlined a white/position paper, assigned ownership to the sections/topics, and defines a draft deliverable of July 6th.

 

If you’re interested in tracking the progress, joining, or getting involved, please check our Gitub page often, join the google group, or contact me directly.


https://github.com/cncf/wg-serverless

 

 

banner2

 

Kenneth Owens

CTO

kenowens@...

Tel: +1 408 424 0872

Cisco Systems, Inc.

16401 Swingley Ridge Road Suite 400
CHESTERFIELD
63017
United States
cisco.com

 

Think before you print.

This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.

Please click here for Company Registration Information.

 

_______________________________________________
cncf-toc mailing list
cncf-toc@...
https://lists.cncf.io/mailman/listinfo/cncf-toc


Re: Serverless Workgroup Kickoff

alexis richardson
 

Thank you all for getting this off the ground.


On Thu, 8 Jun 2017, 18:16 Kenneth Owens (kenowens) via cncf-toc, <cncf-toc@...> wrote:

Just wanted to give everyone an update on the serverless workgroup. We had our initial meeting, agreed on the objective of documenting what Serverless means to Cloud Native, outlined a white/position paper, assigned ownership to the sections/topics, and defines a draft deliverable of July 6th.

 

If you’re interested in tracking the progress, joining, or getting involved, please check our Gitub page often, join the google group, or contact me directly.


https://github.com/cncf/wg-serverless

 

 

banner2

 

Kenneth Owens

CTO

kenowens@...

Tel: +1 408 424 0872

Cisco Systems, Inc.

16401 Swingley Ridge Road Suite 400
CHESTERFIELD
63017
United States
cisco.com

 

Think before you print.

This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.

Please click here for Company Registration Information.

 

_______________________________________________
cncf-toc mailing list
cncf-toc@...
https://lists.cncf.io/mailman/listinfo/cncf-toc


Serverless Workgroup Kickoff

Kenneth Owens (kenowens) <kenowens@...>
 

Just wanted to give everyone an update on the serverless workgroup. We had our initial meeting, agreed on the objective of documenting what Serverless means to Cloud Native, outlined a white/position paper, assigned ownership to the sections/topics, and defines a draft deliverable of July 6th.

 

If you’re interested in tracking the progress, joining, or getting involved, please check our Gitub page often, join the google group, or contact me directly.


https://github.com/cncf/wg-serverless

 

 

banner2

 

Kenneth Owens

CTO

kenowens@...

Tel: +1 408 424 0872

Cisco Systems, Inc.

16401 Swingley Ridge Road Suite 400
CHESTERFIELD
63017
United States
cisco.com

 

Think before you print.

This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message.

Please click here for Company Registration Information.

 


Re: rook.io

Bassam Tabbara <Bassam.Tabbara@...>
 

I understand that part ;)

Sorry I was thrown off by your “what does it do differently than Amazon” :-)

The part that I don't fully grok is what you mean by "storage".  Forgive my naive questions.  

I mean block storage (raw block devices underneath pods comparable to AWS EBS), filesystems (shared POSIX compliant file systems comparable to AWS EFS), and Object Storage (comparable to AWS S3).




Re: rook.io

alexis richardson
 

Bassam

I understand that part ;)

The part that I don't fully grok is what you mean by "storage".  Forgive my naive questions. 

Alexis


On Tue, 6 Jun 2017, 23:18 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Camille, yes! :-)

The premise of Rook is to provide a cloud-native alternative to Amazon S3, EBS and EFS. These storage services can run directly inside the cloud native environment vs. vendor specific implementations outside it. By running them inside the cluster it enables us to to integrate them more deeply (scheduling, security, management, etc.) and enables multi-cloud and federation.

This is similar to the relationship between Istio/Envoy and Amazon ELB, Prometheus and Amazon CloudWatch, etc.

Alexis, I know you were having audio issues, I’d be happy to repeat the talk this morning if you’d like.

On Jun 6, 2017, at 3:01 PM, Camille Fournier <skamille@...> wrote:

Presumably providing an abstraction layer that can front Amazon, Google, and internal cloud offerings. For those of us who need such a thing.

On Jun 6, 2017 5:55 PM, "Alexis Richardson via cncf-toc" <cncf-toc@...> wrote:

I don't understand what you are doing which is better than what Amazon can do alone.


On Tue, 6 Jun 2017, 22:50 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Do you mean EBS? For S3 which has 11 9’s (over a year) of durability you’d need 10s of Rook storage nodes spread across three availability zones.

For block storage, i.e. the use case in the original thread, you can make do with a smaller number of nodes. You can run with two, but I would recommend you deploy more.


On Jun 6, 2017, at 2:43 PM, Alexis Richardson <alexis@...> wrote:

So you need a pair of rook nodes to be more reliable than S3?


On Tue, 6 Jun 2017, 22:27 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.

However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.

On Jun 6, 2017, at 2:22 PM, Alexis Richardson <alexis@...> wrote:

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>






_______________________________________________
cncf-toc mailing list
cncf-toc@...
https://lists.cncf.io/mailman/listinfo/cncf-toc




Re: Continued InfraKit Discussion

David Chung <david.chung@...>
 

Hi Chris,  

Thanks for starting the thread.   In a nutshell, InfraKit focuses on being lightweight and embeddable while enabling self-management and capacity scaling features for the entire, higher-level system.   There are also differences in terms of state management and deployment when compared to these larger, established tools.  Brief summaries of how InfraKit compares to Terraform, BOSH and Digital Rebar are included here.  Feedback, comments and questions are most welcome.

Thanks,
David


InfraKit vs Terraform

Terraform and InfraKit share the features of declarative infrastructure.  Terraform can provision infrastructure resources and reconcile differences between infrastructure and user specification.  InfraKit builds on the same concepts and adds continuously running microservices or controllers, so that the reconciliation is continuous rather than on-demand.  InfraKit also provides features such as node scaling group so that cluster capacity scaling can be triggered directly from higher-level orchestration systems.   In this use case, InfraKit and terraform can be complementary, with terraform being the ‘execution backend’ for InfraKit.  As an example, there is a terraform plugin in the project which can be leveraged to provision resource types currently not natively supported by InfraKit.

 

Terraform and InfraKit differ in terms of state management and deployment.  InfraKit can leverage the system it is embedded into, such as Docker Swarm or etcd, for leader election and persistence of user specs.  Unlike Terraform which has its own schema and different backend plugins for storing infrastructure state, InfraKit derives the infrastructure state via introspection and queries supported by the lower-level platform (tags and metadata).  The infrastructure itself is the master of records, rather than a representation of it which can be corrupted or go out of sync. InfraKit can be a part of the control plane of a higher-level orchestration system without additional resource requirements.  In this use case, InfraKit also readily integrates with the higher level system as service containers, while Terraform is primarily a CLI tool for ops and not meant to be run as a service for higher level systems.

 

InfraKit is designed as an active system continuously monitoring your infrastructure and reconciling its reality with a specification provided either by a user or a higher level orchestration system, with an immutable infrastructure approach, while Terraform is designed as a tool to automate existing ops workflows using a declarative description of infrastructure, without the active monitoring and reconciliation approach.

 

InfraKit vs BOSH

BOSH is much more vertically integrated than InfraKit in that it is itself a full-fledged distributed application with its own datastore (Postgres) and server components.  In terms of workflow, BOSH also covers areas outside of infrastructure provisioning and management -- it has opinionated workflows for application definition and software releases which reaches into the upper layers of the CNCF model.  The full system serves as part of control plane of a much larger platform and thus requires its own provisioning and management.  

 

InfraKit takes a lighter weight approach.  As a set of pluggable microservices, it is meant to be embedded into a larger system to minimize additional operational burdens.  With InfraKit embedded in the same control plane, these systems can self-install and self-manage.  InfraKit also does not require a database such as Postgres to store state.  Instead, InfraKit inspects the infrastructure and reconstructs what is in the environment for reconciliation with user specification.  This means the only thing that can diverge from user spec is the true infrastructure state.  If divergence is detected, InfraKit will attempt to correct it.   InfraKit also can play the role of a cloud-provider for a node group autoscaler for a container orchestrator like k8s or Docker Swarm, thereby bringing the elasticity common in the public cloud to on-prem bare-metal or virtualized environments.

 

InfraKit vs Digital Rebar

 

Digital Rebar compares similarly to RackHD or MaaS.  Currently there is community integration with RackHD, and there is a MaaS plugin in the InfraKit repo.  At a high-level, Digital Rebar and InfraKit are complementary in that InfraKit can leverage Digital Rebar for bare-metal provisioning via DR’s PXE/DHCP/TFTP infrastructure, while InfraKit provides the orchestration capabilities such as scaling up/down clusters and rolling updates.

 

Digital Rebar is itself a full-fledged distributed application of many components, including a Postgres database where it stores infrastructure state. In terms of deployment, it has its own control plane that needs to be provisioned and maintained.  InfraKit can be embedded inside the control plane of a higher-level orchestration system such as Kubernetes and Docker Swarm.  InfraKit leverages these systems for persistence of user infrastructure specification and for leader election (to provide high availability), and infrastructure state is reconstructed by introspecting the infrastructure.  This means that InfraKit is more a ‘kit’ then another platform and higher-level systems can incorporate InfraKit to provide self-managing and self-provisioning capabilities.

 

 

 

 

 
 
 


On Tue, Jun 6, 2017 at 8:41 AM, Chris Aniszczyk <caniszczyk@...> wrote:
From today's CNCF TOC call, there was some discussion on how InfraKit compares to Terraform, BOSH and Digital Rebar. Thanks again to David for taking the time to present.

Let's use this thread to have that discussion.

--
Chris Aniszczyk (@cra) | +1-512-961-6719


Re: rook.io

Bassam Tabbara <Bassam.Tabbara@...>
 

Camille, yes! :-)

The premise of Rook is to provide a cloud-native alternative to Amazon S3, EBS and EFS. These storage services can run directly inside the cloud native environment vs. vendor specific implementations outside it. By running them inside the cluster it enables us to to integrate them more deeply (scheduling, security, management, etc.) and enables multi-cloud and federation.

This is similar to the relationship between Istio/Envoy and Amazon ELB, Prometheus and Amazon CloudWatch, etc.

Alexis, I know you were having audio issues, I’d be happy to repeat the talk this morning if you’d like.

On Jun 6, 2017, at 3:01 PM, Camille Fournier <skamille@...> wrote:

Presumably providing an abstraction layer that can front Amazon, Google, and internal cloud offerings. For those of us who need such a thing.

On Jun 6, 2017 5:55 PM, "Alexis Richardson via cncf-toc" <cncf-toc@...> wrote:

I don't understand what you are doing which is better than what Amazon can do alone.


On Tue, 6 Jun 2017, 22:50 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Do you mean EBS? For S3 which has 11 9’s (over a year) of durability you’d need 10s of Rook storage nodes spread across three availability zones.

For block storage, i.e. the use case in the original thread, you can make do with a smaller number of nodes. You can run with two, but I would recommend you deploy more.


On Jun 6, 2017, at 2:43 PM, Alexis Richardson <alexis@...> wrote:

So you need a pair of rook nodes to be more reliable than S3?


On Tue, 6 Jun 2017, 22:27 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.

However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.

On Jun 6, 2017, at 2:22 PM, Alexis Richardson <alexis@...> wrote:

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>






_______________________________________________
cncf-toc mailing list
cncf-toc@...
https://lists.cncf.io/mailman/listinfo/cncf-toc




Re: rook.io

Camille Fournier
 

Presumably providing an abstraction layer that can front Amazon, Google, and internal cloud offerings. For those of us who need such a thing.

On Jun 6, 2017 5:55 PM, "Alexis Richardson via cncf-toc" <cncf-toc@...> wrote:

I don't understand what you are doing which is better than what Amazon can do alone.


On Tue, 6 Jun 2017, 22:50 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Do you mean EBS? For S3 which has 11 9’s (over a year) of durability you’d need 10s of Rook storage nodes spread across three availability zones.

For block storage, i.e. the use case in the original thread, you can make do with a smaller number of nodes. You can run with two, but I would recommend you deploy more.


On Jun 6, 2017, at 2:43 PM, Alexis Richardson <alexis@...> wrote:

So you need a pair of rook nodes to be more reliable than S3?


On Tue, 6 Jun 2017, 22:27 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.

However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.

On Jun 6, 2017, at 2:22 PM, Alexis Richardson <alexis@...> wrote:

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>






_______________________________________________
cncf-toc mailing list
cncf-toc@...
https://lists.cncf.io/mailman/listinfo/cncf-toc



Re: rook.io

alexis richardson
 

I don't understand what you are doing which is better than what Amazon can do alone.


On Tue, 6 Jun 2017, 22:50 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Do you mean EBS? For S3 which has 11 9’s (over a year) of durability you’d need 10s of Rook storage nodes spread across three availability zones.

For block storage, i.e. the use case in the original thread, you can make do with a smaller number of nodes. You can run with two, but I would recommend you deploy more.


On Jun 6, 2017, at 2:43 PM, Alexis Richardson <alexis@...> wrote:

So you need a pair of rook nodes to be more reliable than S3?


On Tue, 6 Jun 2017, 22:27 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.

However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.

On Jun 6, 2017, at 2:22 PM, Alexis Richardson <alexis@...> wrote:

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>






Re: rook.io

Bassam Tabbara <Bassam.Tabbara@...>
 

Do you mean EBS? For S3 which has 11 9’s (over a year) of durability you’d need 10s of Rook storage nodes spread across three availability zones.

For block storage, i.e. the use case in the original thread, you can make do with a smaller number of nodes. You can run with two, but I would recommend you deploy more.

On Jun 6, 2017, at 2:43 PM, Alexis Richardson <alexis@...> wrote:

So you need a pair of rook nodes to be more reliable than S3?


On Tue, 6 Jun 2017, 22:27 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.

However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.

On Jun 6, 2017, at 2:22 PM, Alexis Richardson <alexis@...> wrote:

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>






Re: rook.io

alexis richardson
 

So you need a pair of rook nodes to be more reliable than S3?


On Tue, 6 Jun 2017, 22:27 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.

However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.

On Jun 6, 2017, at 2:22 PM, Alexis Richardson <alexis@...> wrote:

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>





Re: rook.io

Bassam Tabbara <Bassam.Tabbara@...>
 

Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.

However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.

On Jun 6, 2017, at 2:22 PM, Alexis Richardson <alexis@...> wrote:

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>





Re: rook.io

alexis richardson
 

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>




Re: rook.io

Bassam Tabbara <Bassam.Tabbara@...>
 

No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>




Re: rook.io

alexis richardson
 

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>



Re: rook.io

Bassam Tabbara <Bassam.Tabbara@...>
 

Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 

On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>



Re: rook.io

alexis richardson
 

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>

5981 - 6000 of 6969