Volunteer data archival
Volunteer data archival means using disk space on volunteered home computers to store large data files. This document describes the design of a VDAB, a system to provide volunteer data archival on BOINC. The goals of VDAB include:
- Storing large (e.g. petabyte) files. Files may be thousands of times larger than the amount of space available on individual computers.
- Store files are long periods.
- Be able to reduce the probability of data loss to arbitrarily small levels.
Properties of the volunteer host population include:
- A host may be sporadically available because it is turned off, or because the user has suspended network activity. Unavailable periods may range from minutes to several days.
- The upload and download speeds of hosts vary widely, and can be fairly low (e.g. 1 Mbps) in some cases.
- The amount of disk space available to a project on a given host may fluctuate over time, because of the user's own disk usage or disk usage by other BOINC projects to which the host is attached.
- The population is dynamic: hosts are constantly arriving and leaving. The mean lifetime of a host may be fairly small (on the order of 100 days).
- Many hosts are behind firewalls. We assume that all communication is initiated by the BOINC client, and involves HTTP requests to trusted project servers. We don't consider direct client-to-client communication.
Recovering from the failure of a host, using techniques like replication, involves uploading data from a 2nd host, then downloading it to a 3rd host. Each of these steps may take days. Thus, for volunteer storage the ratio
average time to failure / average time to recover
may be fairly small (like 100). In other distributed storage systems (such as RAIDs) this ratio may be on the order of 100,000. Thus, these systems can modeled as a sequence of individual failures and recoveries.
Volunteer data archival, on the other hand, must be modeled as process in which multiple recoveries may be in progress at the same time, and new failures may occur during these recoveries.
The need for server storage
Initially a file is stored in its entirety on the server. It is downloaded to volunteer hosts. Eventually it is retrieved, i.e. uploaded to the server again, and perhaps deleted from volunteer hosts.
However, server storage must be used even while the file is being stored on volunteer hosts. This is because the mechanisms to handle host failures (see below) involve uploading parts of the file to the server, then downloading them to other hosts.
One of the goals of VDAB is to minimize the average amount of server storage required to maintain reliability.
Also, note that data recovery uses network bandwidth. It's conceivable that the capacity of the system is limited not by client disk space, but by network bandwidth at the server.
There are two basic techniques for achieving reliable storage using unreliable resources:
With this technique, a file is divided into N chunks, and each chunk is stored on M hosts. If a replica is lost, and there another replica, that replica is uploaded to the server, then downloaded to another host. By increasing M, reliability can be made arbitrarily high.
Replication has advantages:
- Recovery from a failure is fast, since only one upload and download is done. This minimizes the chances of another failure occurring during recovery.
- By making N large, the server storage needed for a recovery can be made arbitrarily small.
- It has an extremely high space overhead, since M in general must be made large to provide reliability.
- Even if individual chunks are made reliabile, the failure rate for the file as a whole increases exponentially with N
With Reed-Solomon coding, a file is divided into N 'packets', and an additional K checksum packets are generated. The original data can be reconstructed from any N of these N+K packets.
Coding has advantages:
- It can provide high reliability without high space overhead. For example, if N=40 and K=20, we can tolerate 20 simultaneous host failures with a space overhead of only 50%; with replication the overhead would be 2000%.
- Regenerating a chunk requires reassembling the entire file on the server, imposing a high storage and network communication overhead.
Hybrid reliability mechanisms
Because of the above disadvantages, neither replication nor coding alone is sufficient for volunteer data archival. However, we can combine them in various ways that reduce the disadvantages.
One way to reduce the reconstruction overhead of coding is to divide the file into M parts, and encode each part separately. That way, if a packet is lost, only 1/M of the file needs to be reconstructed on the server.
However, if one of these M parts is lost, the file is lost. To remedy this, we can use coding at the top level as well: in addition to the M parts, generate an additional K "checksum parts", and encode these parts in the same way.
If we use this 2-level encoding scheme with parameters M=40 and N=20, we can recover from any 400 simultaneous host failures, with a space overhead of 125%.
The scheme can be extended to any number of levels of encoding.
Coding plus replication
To achieve high reliability, we need to use fairly large values of coding's N and K parameters, like 10-50. This means that recovering from a packet loss requires uploading and downloading 10-50 packets, which is a large overhead.
We can potentially use replication at the bottom level to reduce this overhead. Suppose, for example, that we use 2-fold replication for the bottom-level packets of multi-level encoding. Then, in many (and maybe even almost all) cases we'll just have to do 1 upload and 2 download to restore the packet. Although this doubles the client storage requirement, it could potentially increase system capacity by reducing network bandwidth at the server.
The VDAB simulator
We have developed a simulator for VDAB. The simulator models a set of hosts. The parameters of the host population include:
- Arrival rate of hosts
- Distribution of host lifetimes (currently exponential, with adjustable mean)
- Distribution of upload and download network bandwidth
- Distribution of amount of free disk space
The simulator models the arrival of one or more files, each with a given size.
The simulator is able to model the following storage policies:
- M-level coding
- Different values of N and K at each level of coding
- R-fold replication at the bottom level
The simulator outputs:
- statistics of server disk space usage
- statistics of network bandwidth usage
- statistics of "vulnerability": how many host failures would be needed to cause the loss of each file.