Small Grant: Sia Virtual Block Device (sia_vbd)

rrauch · October 1, 2024, 5:48pm

Project Name:

Sia Virtual Block Device (sia_vbd)

Name of the organization or individual submitting the proposal:

Roland Rauch

Describe your project:

sia_vbd implements virtual block devices on top of renterd. Essentially, it provides users with virtual disks that are location-independent, can grow to almost any size, allow snapshots, branching, are deduplicated and compressed, and are fully backed by Sia objects.

How It Works

sia_vbd organizes data into Blocks, which are fixed-size units of data addressed by their cryptographic hash. These blocks are grouped into larger structures called Clusters, which are collections of block hashes forming a Merkle tree. Multiple clusters together form the state of the block device in a similar Merkle tree manner. This design makes sia_vbd virtual disks similar to Git repositories in nature.

Blocks are compressed and saved in Chunks, which are stored as regular Sia objects with additional user metadata indicating the contained blocks and their offsets.

The virtual disks are exported to the user over the network, initially via nbd (Network block device - Wikipedia) with the ultimate goal of also supporting iscsi (iSCSI - Wikipedia). Once connected, the virtual disk looks like any regular disk to the user, allowing formatting, partitioning, and other standard disk operations.

Under The Hood

In the background, sia_vbd maintains a block cache and a Write-Ahead Log (WAL):

Read Requests: These are mapped to the corresponding block and served either directly from the cache if available, or by fetching the block from renterd if not.
Write Requests: These are handled by first updating the affected blocks locally, recalculating their hashes, and committing any new blocks to the local WAL. Once the WAL reaches a certain size, the contained blocks are compressed, written to a new Chunk, and uploaded to renterd, making the current state permanent.
Garbage Collection: Periodically, a garbage collection task identifies Chunks with many unused Blocks. The task then consolidates the current Blocks into new Chunks and deletes the old, now obsolete Chunks.

Similar to my previous project, sia_vbd will be implemented in Rust and will be made available as a standalone binary and a Docker image with no other dependencies besides renterd and common system libraries.

This project proposal is in response to an RFP found at Sia - Grants.

How does the projected outcome serve the Foundation’s mission of user-owned data?

Sia natively provides an Object Storage interface. My previous project, sia_nfs, added a virtual file system accessible over NFS. Now, with sia_vbd, my aim is to implement a virtual block device on top of Sia’s object storage, providing the missing piece to make Sia a unified storage solution.

sia_vbd allows:

Use cases that are not served by Object Storage or File System access
Users to have fully decentralized, globally distributed virtual disks that they can attach, detach, and move around at will
Virtual disks to be used as native disks for VMs
A single sia_vbd server to serve an entire network
Better enterprise integration with workloads that do not fit with the other two storage types

With all three storage types available, users have the flexibility to choose the most suitable storage type for their needs, whether it’s Object Storage, File System, or Block Storage.

Grant Specifics

Amount of money requested and justification with a reasonable breakdown of expenses:

The total amount requested is USD 8,000, which covers:

8 weeks of full-time work (320 hours @ USD 25/hour).

No additional equipment is required. During development, the testnet will be used, no SC are required.

What are the goals of this small grant?

The goal of the grant is to provide sufficient funding for the development of sia_vbd. The time estimate is based on previous experience building sia_nfs, the work I can reuse from that project, specifically renterd_client, and my prior experience in creating a virtual block device with an nbd interface.

Development Timeline:

Two milestones are planned:

Milestone 1: Version 0.1.0 at the end of week 4. This version will be very basic and largely untested. Basic functionality will be there, but performance is expected to be slow. I/O scheduling will not be optimized, and non-core functions, such as resizing and snapshots, will be absent. Only the basic functionality will undergo testing at this stage.
Milestone 2: Version 0.2.0 at the end of week 8. I/O scheduling will be optimized, resulting in improved performance. Many use cases will have been tested, including on Windows. Missing functions, such as resizing and snapshots, will be included. Usage documentation and a Docker image will be available. This will be the first generally usable release.

Features & Scope:

A single, standalone program that runs on every major platform where renterd is available.
Block Cache & WAL (Write-Ahead Log)
Basic functionality to create, resize, and delete a block device
Snapshots and Branching
nbd support
Fully open source (Apache-2.0 & MIT licenses), with a public repository on GitHub.
Basic usage documentation and example configurations.
A small, standalone Docker image for a simplified user experience

Potential risks that may affect the outcome of the project:

My previous experience building sia_nfs has shown that data access latencies can vary significantly when reading object data from renterd. I have observed latencies in the 400-500ms range, but also in the 5000ms range, and occasionally even higher. This is likely partly because I was working on the testnet, but it also reflects the inherent nature of a completely decentralized, globally distributed storage network. Many applications are not designed for these latencies, which can seriously limit the practicality of solutions like this one. Furthermore, access patterns such as out-of-order reading/writing, read-ahead, or frequent seeking (especially backwards) have caused major issues when implementing sia_nfs. I spent considerable time developing, implementing, and testing strategies to mitigate these issues and eventually came up with a solution that works well enough in most cases. I have incorporated these lessons into the design of sia_vbd and will implement similar strategies to work around these limitations. However, these fundamental issues exist, and not every use case will work well with sia_vbd.
nbd does not have the same support as iscsi. nbd is natively available on Linux, can be installed on Windows (via the Ceph for Windows project - GitHub - cloudbase/wnbd: Windows Ceph RBD NBD driver) with some limitations (Make the driver signed - to make NBD usable on Windows 11 and up without tons of hassle · Issue #89 · cloudbase/wnbd · GitHub), and has very limited macOS support (GitHub - elsteveogrande/osx-nbd: NBD client driver for OSX). Interestingly, Apple supports nbd natively in its Virtualization Framework (VZNetworkBlockDeviceStorageDeviceAttachment | Apple Developer Documentation), but I don’t believe this is useful for most users.
Compatibility: Although, in theory, any block device should work with any filesystem, this might not always be the case in practice. When I previously implemented a virtual block device several years ago, I developed it for a specific filesystem. The first time I tested it with a different filesystem, it caused an immediate kernel panic. Sometimes implementations rely on subtle details that should not matter in theory but do in practice. Additionally, users are free to use the virtual disk as they please. They can partition it in various ways, build a software RAID, use it with lvm, and much more. I cannot guarantee 100% compatibility in all cases. That said, I will certainly test it against what I believe to be the most common cases—and some uncommon ones—and I am fairly confident it will not have too many compatibility issues in practice. However, this is a risk that needs to be acknowledged.

A Word on `iscsi`

Initially, this project was supposed to be called sia_iscsi and was meant to include support for both iscsi and nbd as access protocols. However, I decided to change my proposal for two reasons:

A project with a very similar name has been proposed recently. To avoid confusion, I decided to change the name of my project to sia_vbd.
Risks and scope: The network protocol for iscsi is significantly more complex than nbd. Additionally, I need to emulate a virtual SCSI device (the scsi part in iscsi). SCSI is extensive—the command reference manual alone is over 500 pages long. This essentially constitutes its own project and will require a lot of testing against numerous iscsi initiators, operating systems, file systems, etc.

To keep the scope clear and the risks manageable, I decided to split my initial project into two parts. For now, I am focusing on implementing a working, Sia-backed virtual disk, as described above, and making it available via nbd. Once I have delivered on that, I will submit a proposal for part two—iscsi support. By then, I will also have a better idea of how this needs to be approached than I do now.

Development Information

Will all of your project’s code be open-source?

Yes, the code will be fully open source and will be made available on GitHub (Apache-2.0 & MIT licenses). Furthermore, all libraries used are also open source.

Leave a link where the code will be accessible for review.

A repository will be created on GitHub once the grant is approved.

Do you agree to submit monthly progress reports?

Of course

Contact Info

Email: [email protected]

Previous Related Projects

daniel-lucio · October 1, 2024, 8:19pm

As a developer and long-time Linux-heavy user, I have many questions about your approach.
iSCSI or VBD devices operate at the block level, while SIA works at the file level. This is one of the main reasons why it is very easy to have NFS, FUSE and S3 support for SIA.

How will you do the “mapping” (I can’t find a better word) of a file to a block request? This is a very technical question, don’t hesitate to be technical.

Will this be coded in C, hence a Kernel Module? If you do, how will you deal with the “magic number”, do you have plans to deliver a DKMS package?

How are you going to deal with partial writes? SIA (renterd) doesn’t have that feature.

rrauch · October 2, 2024, 1:13pm

Thanks for your comment. I’ll try to explain everything in more detail.

Apologies if some of this seems too basic, I just want to make things clearer for everyone who is reading this.

To begin, I want to briefly mention the three main types of storage:

Block storage
File system access
Object storage, sometimes called blob storage

Natively, Sia provides an object storage interface to its data via renterd. However, this does not mean we cannot build middleware on top of it that provides the other two types. Several existing projects, including yours and mine, have implemented file system access on top of renterd. However, block storage is still missing, hence sia_vbd.

Regarding your question about how this is actually done, this was addressed in the proposal, mainly in the following sections:

However, I had to keep it brief and not too technical, so it might not have been as clear as it could have been.

Here is a more detailed explanation:

Introduction

What is a Block Device?

Essentially, a block device is a large, addressable blob. Simplified, it handles requests like:

Reading n bytes from offset o
Writing bytes [0xaa, 0xbb, 0xcc, ...] to offset o

That’s the gist of it. Traditionally, users would create a partition table on it, then create file systems on a partition, and then read/write files and directories within it, etc. All these actions occur at a higher level. The block device itself just deals with these read and write requests (and some others of course, but they are omitted here for brevity) and is blissfully unaware of what the data represents or how it is used. It operates on a lower level.

How to Interface with Block Devices

A Brief History

Historically, block devices such as hard disks, SSDs, USB sticks, etc., have usually been directly attached to a computer, and interfaces such as ATA, SCSI, and SATA were used to communicate with them. This is still the most common way today. At one point, people considered separating compute and storage. It made a lot of sense: data would not be bound to a specific compute node, and storage and compute could scale independently.

In the end, interfacing with a block device is just sending and receiving requests, so there is no reason why it has to be limited to a local bus and can’t go over a network.

Initially, people built expensive, dedicated fibre networks with special hardware that only handled storage-related traffic - the SAN was born. Later, people realized it could be more flexible and cost-effective to route storage-related traffic over the same infrastructure as all the other traffic.

This led to:

Network Protocols

Several protocols emerged to address this need, most notably:

iSCSI: Essentially SCSI tunneled over IP. Although the network protocol itself is moderately complicated, it comes with decades of legacy as its functioning depends on the tunneled SCSI commands. It makes a lot of sense for what it was designed for: exporting an existing SCSI device over IP. However, if you want to make other, non-SCSI block devices available over the network, it is much more difficult - essentially, you need to emulate a SCSI device. And the SCSI command reference alone is over 500 pages long!
ATA over Ethernet (AoE): It was somewhat popular for a brief moment, but being Layer 2 (and thus not routable) and also bound to ATA commands, it was rather short-lived.
nbd: NBD is much simpler than iSCSI and not tightly coupled to an underlying storage protocol. It only has a few commands and is very easy and straightforward to implement. I have previously written a Virtual Block Device with an NBD interface, and it was mostly a straightforward task. NBD is still around and in use, but it never reached the broad popularity of iSCSI, even though it actually predates iSCSI.

Virtual Block Devices

After clarifying what a block device is, it should now be clear that there is nothing stopping us from implementing one in software. In fact, most modern hardware block devices are implemented in software to a large degree. Every SSD is basically emulating a block device in its firmware. Every modern storage controller is essentially a dedicated embedded system running custom software (often on Linux) to provide block device interfaces to the underlying data. A whole industry has emerged around the term Software-defined Storage - software has really been eating the (storage) world.

How `sia_vbd` Works

I tried to answer this briefly in the proposal above, aiming to convey the core principles. However, due to the need for brevity, some details have been omitted. Here is a more detailed description:

Bird’s Eye View

sia_vbd implements Sia Object storage-backed Virtual Block Devices in software and exports them via nbd (and potentially iscsi in the future). It is a single process, written in Rust, as mentioned in the proposal, and runs entirely in userspace. It is supposed to run on all major platforms (Linux, macOS, Windows).

sia_vbd only implements the server (referred to as target in iSCSI terminology). Clients (referred to as initiators in iSCSI) connect to it over the network or localhost.

Once connected, the client (which typically runs in kernel space) makes the remote virtual block device appear as a local one. On Linux, when using nbd, these devices can usually be found under /dev/nbdX. These devices behave largely like local block devices. However, reads and writes are forwarded to sia_vbd over the network, where they are processed, and a response is returned. The internal workings of sia_vbd are abstracted away from the client. Internally, sia_vbd uses renterd, caching, and a local WAL (Write-Ahead Log) to manage operations. To the user, however, it appears as just a regular block device.

The Nitty Gritty

Firstly, sia_vbd is specifically designed for renterd and its properties and limitations. More specifically:

The fact that Sia Objects are immutable (I believe this is what you are referring to when you mentioned ‘partial writes’).
That latency when reading object data from renterd can be very high and is highly variable in practice.

It also tries to avoid unnecessary writes by first storing them in a local WAL and only uploading the new data to renterd when a certain amount has accumulated. These points have been mentioned in the proposal. I have mentioned the latency issue under Risks and have written about it extensively in my previous project.

Now, the technical details - how can sia_vbd actually provide a block device on top of Sia Object Storage:

Data Structures

The smallest unit of data is called a Block. A Block has a fixed size (256KiB for now, though this might change during development), and its id is the hash of its content (probably Blake3). This mean blocks are immutable. Writing to a block will result in a new block if the content has changed, as its hash and therefore its id will change.

Blocks are compressed and stored together in Chunks, which are just regular Sia objects. However, user metadata is used to indicate the blocks stored in the Chunk at their respective offsets. This way, we can determine the contents of a chunk via a quick (and inexpensive) HEAD request to renterd without needing to read the entire Chunk. On startup, sia_vbd will call renterd to quickly scan all the Chunk objects available at a specific bucket/prefix and then build an internal database of all known blocks and their locations.

Multiple blocks (256 for now, but subject to change) are grouped into a larger structure called a Cluster. A Cluster is a Merkle tree with its leaves being the ids (hashes) of the underlying blocks. The id of the Cluster is derived from the Merkle root of this tree. This has the same effect seen with Blocks—every write creates a new version of the affected Blocks and thus automatically creates a new version of the Cluster because it affects the Merkle tree and consequently the Merkle root.

Multiple clusters together form the State (still looking for a better name) of the block device. This is constructed in the same way as a Cluster, but instead of having Block ids as leaves, it is composed of Cluster ids. Here is some ASCII art to illustrate:

Block Device State (acac3b00518f.. | id = Root of Cluster Merkle Tree)
 ├─...
 ├─Cluster 6 (e882e6f6f8.. | id = Root of Block Merkle Tree)
 │  ├─Block 0   (e4fa1555ad.. | id = Content Hash)
 │  ├─Block 1   (b4b9b02e6f..)
 │  ├─...
 │  └─Block 255 (afe04867ec..)
 ├─Cluster 7 (200eee8955..)
 │  ├─Block 0   (9e1e0aba56..)
 │  ├─Block 1   (7cce966f55..)
 │  ├─...
 │  └─Block 255 (60cf670d17..)
 ├─...
 └─Cluster 15
    └─...

This structure allows us to efficiently manage the State of large block devices, e.g.:

Block size: 256 KiB
Block Id: 32 bytes (256 bit)
Cluster size: 256 Blocks
Clusters per GiB: 16
Cluster Id: 32 bytes (256 bit)
State data size: 32 bytes * 16 = 512 bytes

Given this structure, we can keep track of the State using only 512 bytes per GiB.

Reading

Here is an example of how a Read Request would be handled:

A Read Request is received with the details: read 1024 bytes at offset 536871661.
First, we map the offset to the corresponding cluster: offset 486720 is in cluster 8 (ID: ae3a3358b3459c761a3), with a relative offset of 749.
We locate the corresponding block(s): cluster offset 749 is in Block 0, with a relative offset of 749.
We know that Block 0 in Cluster 8 has the ID ce966f5503e292a5138 (Hash of its content).
First, we check if ce966f5503e292a5138 is in our local cache. If it is, we retrieve it from the cache and skip to step 10. If not:
We look up the Chunk(s) that contain Block ce966f5503e292a5138 in our database.
We find it in Chunk 01924cfb-83ce-7ff9-88f7-b2b739f77019 at offset 849067, with a length of 1828502 bytes.
We send a download request to renterd for 1828502 bytes at offset 849067 for the object /chunks/01924cfb-83ce-7ff9-88f7-b2b739f77019.chunk.
The data is downloaded, decompressed, verified, and stored in the local cache.
1024 bytes at offset 749 are read from Block ce966f5503e292a5138 and sent to the client.

In practice, this will require strategies to minimize download latencies as much as possible if we need to request the Block from renterd. I worked a lot on this when optimizing sia_nfs.

Writing

Writes are somewhat more involved:

A write request is received. Details: write [0x3f, 0xff, 0xa1, …] at offset 536871661.
We look up the Cluster and Block as before: Cluster 8 (id ae3a3358b3459c761a3), Block 0, Block id: ce966f5503e292a5138.
At this point, we essentially perform a copy-on-write by copying the Block first, modifying the data at offset 749, and recalculating the Block hash.
The new Block hash is now: 8a7a08d7939550.
Next, we append the new Block 8a7a08d7939550 and its data to the local WAL.
We then update Cluster 8 and change Block 0’s id to 8a7a08d7939550. This triggers a recalculation of the Merkle root, thereby changing the id of the Cluster.
Cluster 8’s id is now fb03f471c35ba13e, so we update it in the State, which leads to the recalculation of the State Merkle tree and changes the State Merkle root.

In practice, sia_vbd will not recalculate the state on every single write. For efficiency reasons, it will do so after a number of writes or when it receives an explicit flush command.

This is why I mentioned in the proposal that sia_vbd block devices are somewhat similarly structured to Git repositories. Essentially, every change creates a new, addressable state—much like a commit in a Git repository. This design works even though Sia’s Object store is immutable because we never overwrite anything. Instead, every modification creates a new state based on the previous one. This approach also gives us deduplication, snapshotting, and branching—essentially for free.

WAL Flushing

As mentioned above, new blocks created from writes are first appended to the local WAL. Once the WAL reaches a certain size, we flush the WAL and start a new one. Flushing means extracting the contained blocks, compressing them, and writing them to a new Chunk. We then upload this Chunk to renterd for permanent storage.

By flushing the WAL, we effectively perform “batch writes” instead of sending every single block to renterd, which would not be practical. However, this process introduces a risk of data loss if we lose an “unflushed” WAL due to hardware failure or other issues. This is a necessary tradeoff to make the system work efficiently.

Garbage Collection

You might have noticed that we always append data but never overwrite (which isn’t possible anyway) or delete it. Instead, sia_vbd periodically runs a garbage collector. Its job is to find Chunks that contain a significant number of orphaned Blocks. Orphaned Blocks are blocks that are not currently referenced in any active State.

When the garbage collector identifies such Chunks, it consolidates them by first extracting the Blocks we still need to keep. These Blocks are then written to a new Chunk and uploaded to renterd. Once this process is complete, it is safe to delete the now obsolete Chunks.

So, old data does get deleted eventually, but not immediately.

That’s it for now

I hope this makes things clearer. Apologies for the lengthy post, but I wanted to make sure I explained everything in detail.

If you’re interested in reading more about this, there is an interesting paper on a similar system:

While it doesn’t work exactly the same way as sia_vbd, many of the core concepts are the same or at least very similar.

Hope this makes the inner workings of sia_vbd a bit clearer. Please keep asking if something is unclear, I appreciate the interest!

daniel-lucio · October 2, 2024, 1:39pm

Ok, so you are going to enumerate blocks and save 1 block = 1 file. For example:
Block 0x320593 would be /sia_bucket/0x320593 file?

You mentioned you are going to use an NBD, so I am guessing you are going to write the NBD client (user space software) and server (this is where renterd is) and use the NBD kernel module to do the interface.

I found a very easy explanation of how NBD works here: linux NBD: Introduction to Linux Network Block Devices | by Kozanoglu Aysad | Medium

daniel-lucio · October 2, 2024, 1:51pm

I also found this: nbdkit / nbdkit · GitLab

It seems it is a NBD server with plunging architecture.

rrauch · October 3, 2024, 9:30am

No, this is not how it works. I’ve explained it in detail above.

No, sia_vbd implements the server-side, the nbd kernel module IS the client. It’s all been explained.

I suggest reviewing the proposal and my previous response, it should clarify everything.

daniel-lucio · October 3, 2024, 5:53pm

ok, so you mention you are going to replicate the way GIT works. But at the end of the day, what I will see on the web interface of my renterd will be files that represent the cluster-block of data that NBD uses, such as:

/chunks/01924cfb-83ce-7ff9-88f7-b2b739f77019.chunk .

6. We look up the Chunk(s) that contain Block *ce966f5503e292a5138* in our database.
This is kind of confusing to me, why use a DB instead of using a directory hierarchy? Renterd returns 404 when an object doesn’t exist, there is less room for errors. Not to mention it will simplify your coding. Just create a function f(offset, len) = [(SIA_FILE_TO_READ, offset, len),]. If you read my siafs code (the write functions) you will see how I did it.

For example:

\sia_nbd_bucket
\ State X
\ Cluster Y
- Block X

Or if you want to keep it simpler:
\sia_nbd_nbukcet
\ chunck-X-offset-y

This also has the advantage that you have the requested data together (From the wording from point number 6, I am assuming it could return more than one chunk). Renterd (by design) will split into chunks and spread them into different Hostd servers, I think there is no need to split it on top of that.

rrauch · October 7, 2024, 2:49pm

Ha, replicate is certainly too strong a word. I wrote in the proposal:

and later when I clarified things

I chose to compare how sia_vbd structures data to how Git works because I expect most technical people, especially developers, to be familiar with Git. This should help convey the concept more clearly. A commit in Git is very similar to what I referred to as State earlier. I even thought about naming it Commit, but felt that might lead to confusion. I’m still on the lookout for a good name to capture this concept.

Just like a Git commit refers to a specific state of the repository, a sia_vbd State refers to a specific state of the virtual block device—complete with an ID and everything. It offers similar benefits, such as deduplication, snapshotting (tagging), and branching. Importantly, nothing is overwritten; we just clean up unused data later during garbage collection, ideal for how Sia’s object storage works.

Being aware of where data is located and ensuring quick access when needed is one of the core competencies of sia_vbd. For example, a block could be:

Already buffered in our heap, ready to use
In the local disk cache
In the local write-ahead log (WAL)
Packed in one or more chunks, accessible via renterd
Nonexistent

sia_vbd sees the big picture and does its best to retrieve blocks in the most efficient manner possible. This fairly comprehensive understanding is crucial for handling the biggest challenges in making sia_vbd usable in practice. Let me explain further:

A naive approach could look like this:

Reading:

A read request comes in to read 1200 bytes at offset 47492749221.
We calculate the block number(s) and relative offset(s), then request the data directly from renterd.
The data is streamed directly to the nbd client.

Writing:

A write request is received to write [0x4e, 0xab, 0x01 …] to offset 47492749221.
We calculate the affected block(s) and download the associated object(s) from renterd.
The block(s) are modified based on the data from the write request.
We delete the object(s) downloaded in step 2 via renterds API, as they are now outdated.
The new block(s) are uploaded and stored as new object(s) with the same name as the ones we just deleted.

This approach is certainly enticing—it’s easy to understand and straightforward to implement. However, while this method would work technically, it quickly collapses under real-world conditions.

Here’s why:

Latency and Throughput

Reading from (and to a smaller extent writing to) the Sia storage network is a high-latency affair that can vary wildly—it’s the nature of the beast. This is especially pronounced when reading lots of small objects; the Time to First Byte can sometimes take seconds if you’re unlucky. We would end up with a block device whose throughput could be measured in KiB per second, making it impractical in practice.

sia_vbd will do its best to avoid this by trading off implementation simplicity for lower latency and higher throughput:

The main aspects to achieve this are:

Blocks are not tightly coupled to their “location”; instead, they are identified by their content (hash).
Blocks are heavily cached locally.
New, previously unknown blocks are first committed to the local Write-Ahead Log (WAL) before being batch-written to renterd, packed in Chunks.
Because sia_vbd has the full picture, it can anticipate the need for a certain block before it is requested and prepare it ahead of time (e.g. read-ahead).
Again, because of this full understanding, sia_vbd can rearrange the read queue and serve requests for blocks we have available locally.
Further, blocks can be prepared in the background while read requests are waiting in the queue and then served in order of availability.

There are limits, of course, and we cannot make the latency-related limitations go away completely. A significant part of the development time will be dedicated to this. It will require a lot of testing and fine-tuning to get it to the point where it works well enough for at least the most typical workloads. In the paper I linked to above, latency is specifically mentioned as the most time-consuming aspect of their implementation, and they mention “nearly 6 ms” when testing using S3. sia_vbd has to deal with latencies that are at least one or even two orders of magnitude higher—with much worse edge cases! Aggressive methods are required to get this to work.

Scaling

sia_vbd needs to be able to handle multi-TiB-sized block devices without breaking much of a sweat—a TiB is not as big as it used to be.

A naive 1 object per block approach would quickly lead to millions of tiny objects that need to be managed by renterd. The overhead would quickly become overbearing:

Example:

Block Size: 256 KiB
Blocks needed per TiB: 4,194,304
Sia Objects per TiB: 4,194,304

An even more naive approach could use a block size identical to the advertised sector size of the virtual block device. This would make it even easier to implement because every read/write request would exactly map to a single block. However, it would look even more extreme on the backend:

Block Size: 4 KiB
Blocks needed per TiB: 268,435,456
Sia Objects per TiB: 268,435,456

So, the most direct approach 1 sector == 1 vbd block == 1 sia object would require a whopping 268 million objects to store!

Clearly, this is not going to scale very far. That’s why the design of sia_vbd stores multiple blocks packed together into Chunks. Here is how the above looks with Chunks:

Block Size: 256 KiB
Chunk Size: 256 blocks
Blocks needed per TiB: 4,194,304
Chunks needed per TiB: 16,384

Approximately 16,000 objects are much more manageable than the numbers we saw with the simpler approaches. This design trade-off allows sia_vbd to scale to actually usable block device sizes at the cost of needing the Chunk indirection.

Initial Storage Size

When creating a new virtual block device, sia_vbd needs to initialize the whole structure. Without the deduplication properties of sia_vbd’s design, a naive approach would require writing the full amount of data, even if it’s all the same, like 0x00. For instance, if we create a new 1 TiB device, we would need to write a full TiB of blocks containing nothing but [0x00, 0x00, ...]. This would not only be very slow but also very wasteful, as a full 1 TiB of data would need to be written to and stored on the Sia network.

By making blocks content-addressable, immutable, and not tightly bound to their location, sia_vbd gains deduplication ability. When creating a new device, we end up with a structure that looks somewhat like this for a 1 TiB device:

1 Block (256 KiB of 0x00) with ID 86bb2b521a10612d5a1d38204fac4fa632466d1866144d8a6a7e3afc050ce7ae (Blake3 hash)
1 Cluster (256 references to the block ID above) with ID cac35ec206d868b7d7cb0b55f31d9425b075082b (Merkle Root of Block IDs)
1 State (16384 references to the cluster ID above) with ID afe04867ec7a3845145579a95f72eca7 (Merkle Root of Cluster IDs)

The Block will be stored in a single Chunk, taking up only a few bytes due to compression. There will be a single Cluster metadata object taking roughly 8 KiB of storage (32 bytes per Block ID * number of blocks, plus headers). A single State metadata object will take about 512 KiB of storage (32 bytes per Chunk ID * the number of chunks, plus headers).

Compared to the naive approach, sia_vbd can initialize a new 1 TiB block device in a few milliseconds. It will only require about 530 KiB of active storage in its empty state, compared to 1 TiB when using the naive approach.

I hope this makes it clear that the approach sia_vbd takes was chosen with care and the trade-offs are well worth it compared to a simpler approach. The naive approach would just not be very practical in real-world situations.

daniel-lucio · October 9, 2024, 3:11pm

Yes, but in your point 6, you state a database, which I am questioning in the design. I am not speaking about anything else. A DB adds more logic and another element to take care of.

With this said, don’t take my comments the wrong way. I am very interested in seeing sia_vbd out. I just have a hard time finding your logic optimal.

Your naive 1 approach is a little unrealistic, but I understand you did it that way to make a point. Are you aware of the 40MB optimal object in SIA (default setting, there are discussions on how to change it)? Like the sectors, this means that a 1-byte file in SIA will use 40MB. With this said, I believe your block size should be around 32MB (the power of the close 2 to 40MB). If you use a hashing function to find the SIA object, you can map sector X, offset Y, and read Z bytes to /sia_bucket/x1/x2/x3/chunckX@leafY. Like a tree but using directories.

I am passing on this knowledge from my experience with FUSE, which is a little different from NBD. Fuse has a write/read() function that requires path, offset and size. I was able to do an L2 caching that writes 1 or 2 MB blocks (these could be your chunks) and the rw() functions were able to find my cached block and calculate the offset, then the data to return.

Also, have you thought about the filesystems on top of the NBD? BRTFS, which is starting to be a standard (as it is declared stable in Linux 5.10.x if not mistaken) has deduplication and compression out of the box. ZFS has it as well. I think your 1TB zeroes write file example could be well covered by having one of these filesystems on top of a sia_vbd.

I look forward to seeing your project approved.

x-star · October 10, 2024, 1:02pm

Would you try to implment sia_vbd with ublk ?
https://github.com/ublk-org/ublksrv

rrauch · October 11, 2024, 2:47pm

When I referred to Database, I didn’t mean a specific implementation approach. Maybe Inventory is a better term and doesn’t carry the same associations. Regardless, this is purely an implementation detail for later.

sia_vbd will operate as a single process in userspace without any external dependencies—such as an RDBMS—except for common system libraries like libc, just as stated in the proposal. Internally, it will likely use SQLite in some capacity, just like renterd does.

Yup. And the way renterd reduces the impact this has on storage efficiency through Upload packing is a great example of trading off implementation simplicity for efficiency.

Users are certainly free to use the block device however they see fit. That being said, not all possible uses will be practical. For example, adding a sia_vbd device to a ZFS zpool is probably not the best idea—but I also don’t anticipate this to be very common.

For now, the focus will be on more typical scenarios. One use case where I believe the Sia storage network can truly shine is offsite backups. sia_vbd 's ability to create snapshots/tags virtually for free is great for this.

Appreciate it!

daniel-lucio · October 11, 2024, 3:04pm

For this, I have already done SIAFS.

rrauch · October 11, 2024, 3:21pm

Ha, interesting that you mention ublk . I’ve been wanting to experiment with it since I stumbled upon the official Rust library earlier this year. Once it matures a bit more, I’ll definitely dive into it.

However, for this project, ublk isn’t really relevant. sia_vbd handles the server-side (aka target) of a network-accessible block device. The clients (initiators) make it appear as a local device, and those already exist.

The first part of the project involves exporting the virtual block device via the nbd protocol. Linux has had an in-kernel nbd client for ages and Windows has an installable driver that acts as an nbd client.

The second part of the project aims to implement the iscsi protocol, which has even broader existing support. However, it’s significantly more complex and will require extensive compatibility testing, so that’s out of scope for now.

In the future, adding ublk as an additional way to access sia_vbd could be an option if it makes sense. It would eliminate the network layer, making it potentially more efficient. On the downside, it’s Linux-only and limits the virtual block device to the same machine that runs sia_vbd .

x-star · October 15, 2024, 2:55am

I am glad to receive your reply.

I have seen your siafs, great work.

ublk can be a virtual block device in user mode. Can it be mapped to the client through NBD protocol or iscsi protocol (such as SCST) on the upper layer?

I look forward to seeing your project approved.

rrauch · October 15, 2024, 2:03pm

ublk is a fairly recent addition to the Linux kernel. It’s a building block (no pun intended) that allows implementing block devices in userspace in a performant and efficient manner. It could be used, for example, to implement an nbd client in userspace—this is actually one of the examples on the official ublk GitHub:

As I mentioned in my previous reply, Linux has had a native in-kernel nbd client for a very long time, so ublk isn’t really relevant for sia_vbd.

steve · October 31, 2024, 7:06pm

Thanks for your latest proposal to The Sia Foundation Grants Program.

After review, the committee has decided to approve your proposal. Your provided info was excellent, and your linked Github repo from your previous grant with the comprehensive guide on how to install and run was appreciated.

We’ll reach out to your provided email address for onboarding. Program onboarding can take a few weeks to complete, so please adjust your timelines accordingly. Congratulations!

rrauch · December 4, 2024, 7:29pm

November 2024 Progress Report

What progress was made on your grant this month?

Studied the NBD protocol specification.
Examined several existing open-source implementations to improve understanding.
Logged network traffic between existing NBD clients and servers.
Utilized Wireguard to inspect traffic and gain deeper insights into the protocol.
Developed an initial, basic implementation of an NBD server.
Successfully tested the server against Linux’s built-in NBD client. The client can:
- Connect
- Handshake
- Negotiate session details
- Transition to the Transmission Phase
- Read, Write, Flush (with a stub backend)
- Orderly end the session

Links to repos worked on this month:

GitHub - rrauch/sia_vbd: Sia Virtual Block Device

What will you be working on this month?

Complete the full implementation of the NBD server, including support for the more modern protocol variant (Structured Reply).
Begin development on the actual Sia backend.
Release Milestone 1.

Kinomora · December 18, 2024, 3:46pm

Hello,

Thank you for your progress report!

Regards,
Kino on behalf of the Sia Foundation and Grants Committee

rrauch · December 31, 2024, 3:02pm

Milestone 1 Released

Milestone 1 of sia_vbd, the first public release, is now available!

This version is a preview. While it doesn’t yet include a renterd-backed persistence layer, it brings in all the key design elements discussed above:

Fully Usable Virtual Block Device
Content-Addressable and Deduplicated Blocks, Clusters & States
Transactional Writes: Every Commit leads to a new, addressable state
Easy Branching & Snapshotting is possible

All of these features are implemented in a way that aligns with the typical expectations of block device consumers.

sia_vbd includes a brand-new, purpose-built NBD server. The NBD protocol has evolved significantly since my last implementation. The main new features are Structured Replies and Extended Headers, which significantly enhance the protocol’s functionality and performance. However, none of the existing Rust NBD server libraries support these newer features.
So I decided to develop a new implementation from scratch. It took a bit more time upfront, but the outcome has been great. The new server includes advanced features like:

Structured Replies
Extended Headers
Multiple Connections
Extended Handshake Options: Such as block size preferences
Optimal Zero Handling

Despite these enhancements, the server remains fully backwards compatible with older clients. During development, it has been continuously tested against Linux’s built-in client - which does not yet support the newer protocol features - and nbdublk, a modern userland NBD client that does support the latest protocol enhancements.

Designed to reduce latency, performance has been excellent. Both ext4 and xfs have been used during testing, and both work seamlessly.

What’s Next

The next release will feature a fully functional, renterd-backed backend.

Git Repository

rrauch · January 2, 2025, 6:14pm

December 2024 Progress Report