Date   

Re: rook.io

Camille Fournier
 

Presumably providing an abstraction layer that can front Amazon, Google, and internal cloud offerings. For those of us who need such a thing.

On Jun 6, 2017 5:55 PM, "Alexis Richardson via cncf-toc" <cncf-toc@...> wrote:

I don't understand what you are doing which is better than what Amazon can do alone.


On Tue, 6 Jun 2017, 22:50 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Do you mean EBS? For S3 which has 11 9’s (over a year) of durability you’d need 10s of Rook storage nodes spread across three availability zones.

For block storage, i.e. the use case in the original thread, you can make do with a smaller number of nodes. You can run with two, but I would recommend you deploy more.


On Jun 6, 2017, at 2:43 PM, Alexis Richardson <alexis@...> wrote:

So you need a pair of rook nodes to be more reliable than S3?


On Tue, 6 Jun 2017, 22:27 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.

However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.

On Jun 6, 2017, at 2:22 PM, Alexis Richardson <alexis@...> wrote:

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>






_______________________________________________
cncf-toc mailing list
cncf-toc@...
https://lists.cncf.io/mailman/listinfo/cncf-toc



Re: rook.io

alexis richardson
 

I don't understand what you are doing which is better than what Amazon can do alone.


On Tue, 6 Jun 2017, 22:50 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Do you mean EBS? For S3 which has 11 9’s (over a year) of durability you’d need 10s of Rook storage nodes spread across three availability zones.

For block storage, i.e. the use case in the original thread, you can make do with a smaller number of nodes. You can run with two, but I would recommend you deploy more.


On Jun 6, 2017, at 2:43 PM, Alexis Richardson <alexis@...> wrote:

So you need a pair of rook nodes to be more reliable than S3?


On Tue, 6 Jun 2017, 22:27 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.

However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.

On Jun 6, 2017, at 2:22 PM, Alexis Richardson <alexis@...> wrote:

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>






Re: rook.io

Bassam Tabbara <Bassam.Tabbara@...>
 

Do you mean EBS? For S3 which has 11 9’s (over a year) of durability you’d need 10s of Rook storage nodes spread across three availability zones.

For block storage, i.e. the use case in the original thread, you can make do with a smaller number of nodes. You can run with two, but I would recommend you deploy more.

On Jun 6, 2017, at 2:43 PM, Alexis Richardson <alexis@...> wrote:

So you need a pair of rook nodes to be more reliable than S3?


On Tue, 6 Jun 2017, 22:27 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.

However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.

On Jun 6, 2017, at 2:22 PM, Alexis Richardson <alexis@...> wrote:

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>






Re: rook.io

alexis richardson
 

So you need a pair of rook nodes to be more reliable than S3?


On Tue, 6 Jun 2017, 22:27 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.

However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.

On Jun 6, 2017, at 2:22 PM, Alexis Richardson <alexis@...> wrote:

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>





Re: rook.io

Bassam Tabbara <Bassam.Tabbara@...>
 

Definitely. You would need to have a sizable cluster before you can reach high availability and durability numbers like S3.

However, EBS volumes supposedly have an annual failure rate of 0.1-0.2%, and are rumored to run on a pair of servers for each volume.

On Jun 6, 2017, at 2:22 PM, Alexis Richardson <alexis@...> wrote:

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>





Re: rook.io

alexis richardson
 

Ok so in the single node case you are less reliable than S3?


On Tue, 6 Jun 2017, 22:15 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>




Re: rook.io

Bassam Tabbara <Bassam.Tabbara@...>
 

No. There are no writes to S3 at all, and that would slow down write significantly.

If you’re concerned about DR then you can take a crash-consistent snapshot of the volume(s) and send that off to S3, just as with EBS snapshots.

On Jun 6, 2017, at 2:13 PM, Alexis Richardson <alexis@...> wrote:

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>




Re: rook.io

alexis richardson
 

Does the first storage node write to S3 before or after pushing updates to the other nodes


On Tue, 6 Jun 2017, 22:10 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 


On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>



Re: rook.io

Bassam Tabbara <Bassam.Tabbara@...>
 

Sure. Assuming a pod running in K8S has already mounted a Rook volume the main flow is:

- writes to the volume (say /dev/rbd0) go through the kernel module and then out on the network
- one of the storage nodes receives the data and replicates it to 2-3 other storage nodes (in parallel) in the same pool backing the volume
- once the other replicas acknowledge the write the primary storage node completes the write operation.

Note there is no distinction between local and remote writes in this flow, they all result in network traffic due to replication. Its possible that one of the storage nodes is colocated with the client initiating the write, but thats inconsequential. 

On Jun 6, 2017, at 1:58 PM, Alexis Richardson <alexis@...> wrote:

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>



Re: rook.io

alexis richardson
 

Please can you talk through the interaction flow for a local plus remote disk write on AWS, assuming the write is initiated by a process associated with a container cluster (eg k8s).


On Tue, 6 Jun 2017, 21:39 Bassam Tabbara, <Bassam.Tabbara@...> wrote:
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

> On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
>
> What are the failure cases for this ?
>
> On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
> <Bassam.Tabbara@...> wrote:
>> Alexis,
>>
>> Thanks! We joined the Storage WG and will work with Ben on CSI and future
>> projects.
>>
>> The use case was running Rook Block storage on-top of ephemeral/instance
>> storage on EC2 instances vs. using EBS storage. Rook would handle the
>> replication of data across instances and stripe across them for performance.
>> Pods in the cluster would see this like any other volume.
>>
>> For Pod failover, the detach / detach cycle is much faster than EBS. One of
>> our users compared EBS to Rook [1] and showed that Rook volume failover
>> happened in less than minutes vs. up to an hour with EBS.
>>
>> Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
>> makes them a poor candidate for hot failover scenarios underneath, say,
>> Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
>> we plan to support ReadWriteMany to support a hotter failover where the
>> app/service ontop can handle the fencing.
>>
>> Finally, there are cost and performance tradeoffs for running on-top of
>> ephemeral/instance storage vs. EBS. For example, a lot of the instance
>> storage is unused in most deployments and has a high performance.
>>
>> Happy to discuss in more detail.
>>
>> Thanks!
>> Bassam
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
>> [2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0
>>
>>
>> On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:
>>
>> Bassam
>>
>> It would be good for Rook team to join Storage WG, if you haven't done so
>> yet.
>>
>> QQ: you said that k8s use cases that run on EBS have high failover
>> times & that you can improve this.  I missed the details of that.  Can
>> you say more please?
>>
>> alexis
>>
>>


Re: rook.io

Bassam Tabbara <Bassam.Tabbara@...>
 

The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:

- Volumes are backed by virtual storage pools.
- Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably
- A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported).
- when a storage node is lost the others help re-replicate the data (i.e. data maintenance).

Now the failure cases:

- if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time.
- if there are not enough resources to replicate/regenerate the data before more losses occur.

To guard against such failures, most systems (including Rook) do the following:

- storage nodes are spread across failure domains (different hosts, racks, zones etc.)
- prioritize resources for “data maintenance” over resources used for “normal" data operations.

In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network.

Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.

On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:

What are the failure cases for this ?

On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
<Bassam.Tabbara@...> wrote:
Alexis,

Thanks! We joined the Storage WG and will work with Ben on CSI and future
projects.

The use case was running Rook Block storage on-top of ephemeral/instance
storage on EC2 instances vs. using EBS storage. Rook would handle the
replication of data across instances and stripe across them for performance.
Pods in the cluster would see this like any other volume.

For Pod failover, the detach / detach cycle is much faster than EBS. One of
our users compared EBS to Rook [1] and showed that Rook volume failover
happened in less than minutes vs. up to an hour with EBS.

Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
makes them a poor candidate for hot failover scenarios underneath, say,
Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
we plan to support ReadWriteMany to support a hotter failover where the
app/service ontop can handle the fencing.

Finally, there are cost and performance tradeoffs for running on-top of
ephemeral/instance storage vs. EBS. For example, a lot of the instance
storage is unused in most deployments and has a high performance.

Happy to discuss in more detail.

Thanks!
Bassam

[1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitter.im%2Frook%2Frook%3Fat%3D58baff6f872fc8ce62b6ee26&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=8dSWMXRMqtmPH5goZx4O%2BpVTesQEuS4cb21qgJmmTw0%3D&reserved=0
[2] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46843&data=02%7C01%7CBassam.Tabbara%40quantum.com%7Cac58cc3d749e453fda4e08d4ad178f30%7C322a135f14fb4d72aede122272134ae0%7C1%7C0%7C636323764076275976&sdata=UGWvqRpP8P0sanBnGygcfwIYiU7tvKobJ7s8JtiWlFw%3D&reserved=0


On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:

Bassam

It would be good for Rook team to join Storage WG, if you haven't done so
yet.

QQ: you said that k8s use cases that run on EBS have high failover
times & that you can improve this. I missed the details of that. Can
you say more please?

alexis


Re: rook.io

alexis richardson
 

What are the failure cases for this ?

On Tue, Jun 6, 2017 at 5:41 PM, Bassam Tabbara
<Bassam.Tabbara@...> wrote:
Alexis,

Thanks! We joined the Storage WG and will work with Ben on CSI and future
projects.

The use case was running Rook Block storage on-top of ephemeral/instance
storage on EC2 instances vs. using EBS storage. Rook would handle the
replication of data across instances and stripe across them for performance.
Pods in the cluster would see this like any other volume.

For Pod failover, the detach / detach cycle is much faster than EBS. One of
our users compared EBS to Rook [1] and showed that Rook volume failover
happened in less than minutes vs. up to an hour with EBS.

Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which
makes them a poor candidate for hot failover scenarios underneath, say,
Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2]
we plan to support ReadWriteMany to support a hotter failover where the
app/service ontop can handle the fencing.

Finally, there are cost and performance tradeoffs for running on-top of
ephemeral/instance storage vs. EBS. For example, a lot of the instance
storage is unused in most deployments and has a high performance.

Happy to discuss in more detail.

Thanks!
Bassam

[1] https://gitter.im/rook/rook?at=58baff6f872fc8ce62b6ee26
[2] https://github.com/kubernetes/kubernetes/pull/46843


On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:

Bassam

It would be good for Rook team to join Storage WG, if you haven't done so
yet.

QQ: you said that k8s use cases that run on EBS have high failover
times & that you can improve this. I missed the details of that. Can
you say more please?

alexis


Re: Infrakit Questions

Ihor Dvoretskyi
 

+1 to Rob's initiative on the demo.

Rob and Digital Rebar are doing a valuable job at Kubernetes community - it should be useful to share it across the whole CNCF community.


On Tue, Jun 6, 2017 at 7:04 PM Rob Hirschfeld via cncf-toc <cncf-toc@...> wrote:
Alexis,


For InfrasKit specifically, I'm interested in where this fits or replaces Docker Machine.  There seem to be elements of Docker Machine in the design.

Rob

Rob
____________________________
Rob Hirschfeld, 512-773-7522
RackN CEO/Founder (rob@...)

I am in CENTRAL (-6) time
http://robhirschfeld.com
twitter: @zehicle, github: zehicle

On Tue, Jun 6, 2017 at 8:41 AM, Alexis Richardson <alexis@...> wrote:
Thanks David, Patrick et al., for Infrakit pres today!

https://docs.google.com/presentation/d/1Lzy94UNzdSXkqZCvrwjkcChKpU8u2waDqGx_Sjy5eJ8/edit#slide=id.g22ccd21963_2_0


Per Bryan's Q re Terraform, it would also be good to hear about BOSH &
Infrakit feature comparison.  And other related tech you see in the
space.
_______________________________________________
cncf-toc mailing list
cncf-toc@...
https://lists.cncf.io/mailman/listinfo/cncf-toc


Re: rook.io

Bassam Tabbara <Bassam.Tabbara@...>
 

Alexis,

Thanks! We joined the Storage WG and will work with Ben on CSI and future projects.

The use case was running Rook Block storage on-top of ephemeral/instance storage on EC2 instances vs. using EBS storage. Rook would handle the replication of data across instances and stripe across them for performance. Pods in the cluster would see this like any other volume.

For Pod failover, the detach / detach cycle is much faster than EBS. One of our users compared EBS to Rook [1] and showed that Rook volume failover happened in less than minutes vs. up to an hour with EBS.

Also EBS volumes only support a single writer (ReadWriteOnce in K8S) which makes them a poor candidate for hot failover scenarios underneath, say, Postgres or MySql. With the work we’re doing on the Rook Volume Plugin [2] we plan to support ReadWriteMany to support a hotter failover where the app/service ontop can handle the fencing.

Finally, there are cost and performance tradeoffs for running on-top of ephemeral/instance storage vs. EBS. For example, a lot of the instance storage is unused in most deployments and has a high performance.

Happy to discuss in more detail.

Thanks!
Bassam



On Jun 6, 2017, at 9:03 AM, Alexis Richardson <alexis@...> wrote:

Bassam

It would be good for Rook team to join Storage WG, if you haven't done so yet.

QQ: you said that k8s use cases that run on EBS have high failover
times & that you can improve this.  I missed the details of that.  Can
you say more please?

alexis


Re: Infrakit Questions

Rob Hirschfeld
 

Alexis,


For InfrasKit specifically, I'm interested in where this fits or replaces Docker Machine.  There seem to be elements of Docker Machine in the design.

Rob

Rob
____________________________
Rob Hirschfeld, 512-773-7522
RackN CEO/Founder (rob@...)

I am in CENTRAL (-6) time
http://robhirschfeld.com
twitter: @zehicle, github: zehicle

On Tue, Jun 6, 2017 at 8:41 AM, Alexis Richardson <alexis@...> wrote:
Thanks David, Patrick et al., for Infrakit pres today!

https://docs.google.com/presentation/d/1Lzy94UNzdSXkqZCvrwjkcChKpU8u2waDqGx_Sjy5eJ8/edit#slide=id.g22ccd21963_2_0


Per Bryan's Q re Terraform, it would also be good to hear about BOSH &
Infrakit feature comparison.  And other related tech you see in the
space.


rook.io

alexis richardson
 

Bassam

It would be good for Rook team to join Storage WG, if you haven't done so yet.

QQ: you said that k8s use cases that run on EBS have high failover
times & that you can improve this. I missed the details of that. Can
you say more please?

alexis


Re: Infrakit Questions

alexis richardson
 

Rob

That would be interesting & could also be good material for the CNCF website / blog.

a


On Tue, Jun 6, 2017 at 8:51 AM, Rob Hirschfeld <rob@...> wrote:
All,

I'd be happy to present / demo Digital Rebar to provide another cloud native perspective on how to address hybrid infrastructure automation.  I believe that would help provide a helpful perspective on operational concerns and how to address them in a way that fits the CNCF community.  As you know, we've been heavily involved in the Kubernetes community and have been showing an approach that uses the community Ansible for Kubernetes.  We've also done demos also showing LinuxKit integration.

Rob

Rob
____________________________
Rob Hirschfeld, 512-773-7522
RackN CEO/Founder (rob@...)

I am in CENTRAL (-6) time
http://robhirschfeld.com
twitter: @zehicle, github: zehicle

On Tue, Jun 6, 2017 at 8:41 AM, Alexis Richardson <alexis@...> wrote:
Thanks David, Patrick et al., for Infrakit pres today!

https://docs.google.com/presentation/d/1Lzy94UNzdSXkqZCvrwjkcChKpU8u2waDqGx_Sjy5eJ8/edit#slide=id.g22ccd21963_2_0


Per Bryan's Q re Terraform, it would also be good to hear about BOSH &
Infrakit feature comparison.  And other related tech you see in the
space.



Re: Infrakit Questions

Alex Baretto
 

+1 to Alexis and Rob.

I'd really like to see a good breakdown comparison between Infrakit and digital rebar, bosh, cloudformation, fog,and others

Alex Baretto



On Tue, Jun 06, 2017 at 08:51 Rob Hirschfeld via cncf-toc <Rob Hirschfeld via cncf-toc > wrote:
All,

I'd be happy to present / demo Digital Rebar to provide another cloud native perspective on how to address hybrid infrastructure automation.  I believe that would help provide a helpful perspective on operational concerns and how to address them in a way that fits the CNCF community.  As you know, we've been heavily involved in the Kubernetes community and have been showing an approach that uses the community Ansible for Kubernetes.  We've also done demos also showing LinuxKit integration.

Rob

_______________________________________________
cncf-toc mailing list
cncf-toc@...
https://lists.cncf.io/mailman/listinfo/cncf-toc


Rob
____________________________
Rob Hirschfeld, 512-773-7522
RackN CEO/Founder (rob@...)

I am in CENTRAL (-6) time
http://robhirschfeld.com
twitter: @zehicle, github: zehicle

On Tue, Jun 6, 2017 at 8:41 AM, Alexis Richardson <alexis@...> wrote:
Thanks David, Patrick et al., for Infrakit pres today!

https://docs.google.com/presentation/d/1Lzy94UNzdSXkqZCvrwjkcChKpU8u2waDqGx_Sjy5eJ8/edit#slide=id.g22ccd21963_2_0


Per Bryan's Q re Terraform, it would also be good to hear about BOSH &
Infrakit feature comparison.  And other related tech you see in the
space.



Re: Infrakit Questions

Rob Hirschfeld
 

All,

I'd be happy to present / demo Digital Rebar to provide another cloud native perspective on how to address hybrid infrastructure automation.  I believe that would help provide a helpful perspective on operational concerns and how to address them in a way that fits the CNCF community.  As you know, we've been heavily involved in the Kubernetes community and have been showing an approach that uses the community Ansible for Kubernetes.  We've also done demos also showing LinuxKit integration.

Rob

Rob
____________________________
Rob Hirschfeld, 512-773-7522
RackN CEO/Founder (rob@...)

I am in CENTRAL (-6) time
http://robhirschfeld.com
twitter: @zehicle, github: zehicle

On Tue, Jun 6, 2017 at 8:41 AM, Alexis Richardson <alexis@...> wrote:
Thanks David, Patrick et al., for Infrakit pres today!

https://docs.google.com/presentation/d/1Lzy94UNzdSXkqZCvrwjkcChKpU8u2waDqGx_Sjy5eJ8/edit#slide=id.g22ccd21963_2_0


Per Bryan's Q re Terraform, it would also be good to hear about BOSH &
Infrakit feature comparison.  And other related tech you see in the
space.


Continued InfraKit Discussion

Chris Aniszczyk
 

From today's CNCF TOC call, there was some discussion on how InfraKit compares to Terraform, BOSH and Digital Rebar. Thanks again to David for taking the time to present.

Let's use this thread to have that discussion.

--
Chris Aniszczyk (@cra) | +1-512-961-6719

6361 - 6380 of 7339