Ha, replicate is certainly too strong a word. I wrote in the proposal:
and later when I clarified things
I chose to compare how sia_vbd
structures data to how Git works because I expect most technical people, especially developers, to be familiar with Git. This should help convey the concept more clearly. A commit
in Git is very similar to what I referred to as State
earlier. I even thought about naming it Commit
, but felt that might lead to confusion. I’m still on the lookout for a good name to capture this concept.
Just like a Git commit refers to a specific state of the repository, a sia_vbd
State
refers to a specific state of the virtual block device—complete with an ID and everything. It offers similar benefits, such as deduplication, snapshotting (tagging), and branching. Importantly, nothing is overwritten; we just clean up unused data later during garbage collection, ideal for how Sia’s object storage works.
Being aware of where data is located and ensuring quick access when needed is one of the core competencies of sia_vbd
. For example, a block could be:
- Already buffered in our heap, ready to use
- In the local disk cache
- In the local write-ahead log (WAL)
- Packed in one or more chunks, accessible via
renterd
- Nonexistent
sia_vbd
sees the big picture and does its best to retrieve blocks in the most efficient manner possible. This fairly comprehensive understanding is crucial for handling the biggest challenges in making sia_vbd
usable in practice. Let me explain further:
A naive approach could look like this:
Reading:
- A read request comes in to read 1200 bytes at offset 47492749221.
- We calculate the block number(s) and relative offset(s), then request the data directly from
renterd
. - The data is streamed directly to the
nbd
client.
Writing:
- A write request is received to write [0x4e, 0xab, 0x01 …] to offset 47492749221.
- We calculate the affected block(s) and download the associated object(s) from
renterd
. - The block(s) are modified based on the data from the write request.
- We delete the object(s) downloaded in step 2 via
renterd
s API, as they are now outdated. - The new block(s) are uploaded and stored as new object(s) with the same name as the ones we just deleted.
This approach is certainly enticing—it’s easy to understand and straightforward to implement. However, while this method would work technically, it quickly collapses under real-world conditions.
Here’s why:
Latency and Throughput
Reading from (and to a smaller extent writing to) the Sia storage network is a high-latency affair that can vary wildly—it’s the nature of the beast. This is especially pronounced when reading lots of small objects; the Time to First Byte
can sometimes take seconds if you’re unlucky. We would end up with a block device whose throughput could be measured in KiB per second, making it impractical in practice.
sia_vbd
will do its best to avoid this by trading off implementation simplicity for lower latency and higher throughput:
The main aspects to achieve this are:
- Blocks are not tightly coupled to their “location”; instead, they are identified by their content (hash).
- Blocks are heavily cached locally.
- New, previously unknown blocks are first committed to the local Write-Ahead Log (WAL) before being batch-written to
renterd
, packed inChunks
. - Because
sia_vbd
has the full picture, it can anticipate the need for a certain block before it is requested and prepare it ahead of time (e.g.read-ahead
). - Again, because of this full understanding,
sia_vbd
can rearrange the read queue and serve requests for blocks we have available locally. - Further, blocks can be prepared in the background while read requests are waiting in the queue and then served in order of availability.
There are limits, of course, and we cannot make the latency-related limitations go away completely. A significant part of the development time will be dedicated to this. It will require a lot of testing and fine-tuning to get it to the point where it works well enough for at least the most typical workloads. In the paper I linked to above, latency is specifically mentioned as the most time-consuming aspect of their implementation, and they mention “nearly 6 ms” when testing using S3
. sia_vbd
has to deal with latencies that are at least one or even two orders of magnitude higher—with much worse edge cases! Aggressive methods are required to get this to work.
Scaling
sia_vbd
needs to be able to handle multi-TiB-sized block devices without breaking much of a sweat—a TiB is not as big as it used to be.
A naive 1 object per block approach would quickly lead to millions of tiny objects that need to be managed by renterd
. The overhead would quickly become overbearing:
Example:
- Block Size: 256 KiB
- Blocks needed per TiB: 4,194,304
- Sia Objects per TiB: 4,194,304
An even more naive approach could use a block size identical to the advertised sector size of the virtual block device. This would make it even easier to implement because every read/write request would exactly map to a single block. However, it would look even more extreme on the backend:
- Block Size: 4 KiB
- Blocks needed per TiB: 268,435,456
- Sia Objects per TiB: 268,435,456
So, the most direct approach 1 sector == 1 vbd block == 1 sia object
would require a whopping 268 million objects to store!
Clearly, this is not going to scale very far. That’s why the design of sia_vbd
stores multiple blocks packed together into Chunk
s. Here is how the above looks with Chunk
s:
- Block Size: 256 KiB
- Chunk Size: 256 blocks
- Blocks needed per TiB: 4,194,304
- Chunks needed per TiB: 16,384
Approximately 16,000 objects are much more manageable than the numbers we saw with the simpler approaches. This design trade-off allows sia_vbd
to scale to actually usable block device sizes at the cost of needing the Chunk
indirection.
Initial Storage Size
When creating a new virtual block device, sia_vbd
needs to initialize the whole structure. Without the deduplication
properties of sia_vbd
’s design, a naive approach would require writing the full amount of data, even if it’s all the same, like 0x00
. For instance, if we create a new 1 TiB device, we would need to write a full TiB of blocks containing nothing but [0x00, 0x00, ...]
. This would not only be very slow but also very wasteful, as a full 1 TiB of data would need to be written to and stored on the Sia network.
By making blocks content-addressable, immutable, and not tightly bound to their location, sia_vbd
gains deduplication
ability. When creating a new device, we end up with a structure that looks somewhat like this for a 1 TiB device:
- 1
Block
(256 KiB of0x00
) with ID86bb2b521a10612d5a1d38204fac4fa632466d1866144d8a6a7e3afc050ce7ae
(Blake3 hash) - 1
Cluster
(256 references to the block ID above) with IDcac35ec206d868b7d7cb0b55f31d9425b075082b
(Merkle Root of Block IDs) - 1
State
(16384 references to the cluster ID above) with IDafe04867ec7a3845145579a95f72eca7
(Merkle Root of Cluster IDs)
The Block
will be stored in a single Chunk
, taking up only a few bytes due to compression. There will be a single Cluster
metadata object taking roughly 8 KiB of storage (32 bytes per Block
ID * number of blocks, plus headers). A single State
metadata object will take about 512 KiB of storage (32 bytes per Chunk
ID * the number of chunks, plus headers).
Compared to the naive approach, sia_vbd
can initialize a new 1 TiB block device in a few milliseconds. It will only require about 530 KiB of active storage in its empty state, compared to 1 TiB when using the naive approach.
I hope this makes it clear that the approach sia_vbd
takes was chosen with care and the trade-offs are well worth it compared to a simpler approach. The naive approach would just not be very practical in real-world situations.