Re: rook.io
Bassam Tabbara <Bassam.Tabbara@...>
The most severe failure cases are ones that could lead to permanent data loss. There are a few, but first some background:
toggle quoted messageShow quoted text
- Volumes are backed by virtual storage pools. - Virtual pools are made up of a number of storage nodes that work together to ensure that data is stored reliably - A pool can be configured to store replicas of data (typically 3x) or erasure coded chunks (different algorithms and factors are supported). - when a storage node is lost the others help re-replicate the data (i.e. data maintenance). Now the failure cases: - if too many nodes in the same pool are lost within a short window of time you’ll suffer data loss, for example, all three nodes in a 3x replica are lost at the same time. - if there are not enough resources to replicate/regenerate the data before more losses occur. To guard against such failures, most systems (including Rook) do the following: - storage nodes are spread across failure domains (different hosts, racks, zones etc.) - prioritize resources for “data maintenance” over resources used for “normal" data operations. In the context of running Rook in AWS, this means ensuring that the Rook storage pods are spread across the nodes in the cluster and across availability zones. Also ensuring that you’ve sized the machines and network to support data maintenance. Prioritization schemes also help, for example, SRV-IO is a popular way to do so without massive changes to the network. Finally, this is a good example of why building a control plane that can automate such decisions/tradeoffs helps ensure success with storage.
On Jun 6, 2017, at 1:06 PM, Alexis Richardson <alexis@...> wrote:
|
|