Small Grant: Sia Virtual Block Device (sia_vbd)

Project Name:

Sia Virtual Block Device (sia_vbd)

Name of the organization or individual submitting the proposal:

Roland Rauch

Describe your project:

sia_vbd implements virtual block devices on top of renterd. Essentially, it provides users with virtual disks that are location-independent, can grow to almost any size, allow snapshots, branching, are deduplicated and compressed, and are fully backed by Sia objects.

How It Works

sia_vbd organizes data into Blocks, which are fixed-size units of data addressed by their cryptographic hash. These blocks are grouped into larger structures called Clusters, which are collections of block hashes forming a Merkle tree. Multiple clusters together form the state of the block device in a similar Merkle tree manner. This design makes sia_vbd virtual disks similar to Git repositories in nature.

Blocks are compressed and saved in Chunks, which are stored as regular Sia objects with additional user metadata indicating the contained blocks and their offsets.

The virtual disks are exported to the user over the network, initially via nbd (Network block device - Wikipedia) with the ultimate goal of also supporting iscsi (iSCSI - Wikipedia). Once connected, the virtual disk looks like any regular disk to the user, allowing formatting, partitioning, and other standard disk operations.

Under The Hood

In the background, sia_vbd maintains a block cache and a Write-Ahead Log (WAL):

  • Read Requests: These are mapped to the corresponding block and served either directly from the cache if available, or by fetching the block from renterd if not.
  • Write Requests: These are handled by first updating the affected blocks locally, recalculating their hashes, and committing any new blocks to the local WAL. Once the WAL reaches a certain size, the contained blocks are compressed, written to a new Chunk, and uploaded to renterd, making the current state permanent.
  • Garbage Collection: Periodically, a garbage collection task identifies Chunks with many unused Blocks. The task then consolidates the current Blocks into new Chunks and deletes the old, now obsolete Chunks.

Similar to my previous project, sia_vbd will be implemented in Rust and will be made available as a standalone binary and a Docker image with no other dependencies besides renterd and common system libraries.

This project proposal is in response to an RFP found at Sia - Grants.

How does the projected outcome serve the Foundation’s mission of user-owned data?

Sia natively provides an Object Storage interface. My previous project, sia_nfs, added a virtual file system accessible over NFS. Now, with sia_vbd, my aim is to implement a virtual block device on top of Sia’s object storage, providing the missing piece to make Sia a unified storage solution.

sia_vbd allows:

  • Use cases that are not served by Object Storage or File System access
  • Users to have fully decentralized, globally distributed virtual disks that they can attach, detach, and move around at will
  • Virtual disks to be used as native disks for VMs
  • A single sia_vbd server to serve an entire network
  • Better enterprise integration with workloads that do not fit with the other two storage types

With all three storage types available, users have the flexibility to choose the most suitable storage type for their needs, whether it’s Object Storage, File System, or Block Storage.

Grant Specifics

Amount of money requested and justification with a reasonable breakdown of expenses:

The total amount requested is USD 8,000, which covers:

  • 8 weeks of full-time work (320 hours @ USD 25/hour).

No additional equipment is required. During development, the testnet will be used, no SC are required.

What are the goals of this small grant?

The goal of the grant is to provide sufficient funding for the development of sia_vbd. The time estimate is based on previous experience building sia_nfs, the work I can reuse from that project, specifically renterd_client, and my prior experience in creating a virtual block device with an nbd interface.

Development Timeline:

Two milestones are planned:

  • Milestone 1: Version 0.1.0 at the end of week 4. This version will be very basic and largely untested. Basic functionality will be there, but performance is expected to be slow. I/O scheduling will not be optimized, and non-core functions, such as resizing and snapshots, will be absent. Only the basic functionality will undergo testing at this stage.
  • Milestone 2: Version 0.2.0 at the end of week 8. I/O scheduling will be optimized, resulting in improved performance. Many use cases will have been tested, including on Windows. Missing functions, such as resizing and snapshots, will be included. Usage documentation and a Docker image will be available. This will be the first generally usable release.

Features & Scope:

  • A single, standalone program that runs on every major platform where renterd is available.
  • Block Cache & WAL (Write-Ahead Log)
  • Basic functionality to create, resize, and delete a block device
  • Snapshots and Branching
  • nbd support
  • Fully open source (Apache-2.0 & MIT licenses), with a public repository on GitHub.
  • Basic usage documentation and example configurations.
  • A small, standalone Docker image for a simplified user experience

Potential risks that may affect the outcome of the project:

  • My previous experience building sia_nfs has shown that data access latencies can vary significantly when reading object data from renterd. I have observed latencies in the 400-500ms range, but also in the 5000ms range, and occasionally even higher. This is likely partly because I was working on the testnet, but it also reflects the inherent nature of a completely decentralized, globally distributed storage network. Many applications are not designed for these latencies, which can seriously limit the practicality of solutions like this one. Furthermore, access patterns such as out-of-order reading/writing, read-ahead, or frequent seeking (especially backwards) have caused major issues when implementing sia_nfs. I spent considerable time developing, implementing, and testing strategies to mitigate these issues and eventually came up with a solution that works well enough in most cases. I have incorporated these lessons into the design of sia_vbd and will implement similar strategies to work around these limitations. However, these fundamental issues exist, and not every use case will work well with sia_vbd.

  • nbd does not have the same support as iscsi. nbd is natively available on Linux, can be installed on Windows (via the Ceph for Windows project - GitHub - cloudbase/wnbd: Windows Ceph RBD NBD driver) with some limitations (Make the driver signed - to make NBD usable on Windows 11 and up without tons of hassle · Issue #89 · cloudbase/wnbd · GitHub), and has very limited macOS support (GitHub - elsteveogrande/osx-nbd: NBD client driver for OSX). Interestingly, Apple supports nbd natively in its Virtualization Framework (VZNetworkBlockDeviceStorageDeviceAttachment | Apple Developer Documentation), but I don’t believe this is useful for most users.

  • Compatibility: Although, in theory, any block device should work with any filesystem, this might not always be the case in practice. When I previously implemented a virtual block device several years ago, I developed it for a specific filesystem. The first time I tested it with a different filesystem, it caused an immediate kernel panic. Sometimes implementations rely on subtle details that should not matter in theory but do in practice. Additionally, users are free to use the virtual disk as they please. They can partition it in various ways, build a software RAID, use it with lvm, and much more. I cannot guarantee 100% compatibility in all cases. That said, I will certainly test it against what I believe to be the most common cases—and some uncommon ones—and I am fairly confident it will not have too many compatibility issues in practice. However, this is a risk that needs to be acknowledged.

A Word on iscsi

Initially, this project was supposed to be called sia_iscsi and was meant to include support for both iscsi and nbd as access protocols. However, I decided to change my proposal for two reasons:

  • A project with a very similar name has been proposed recently. To avoid confusion, I decided to change the name of my project to sia_vbd.
  • Risks and scope: The network protocol for iscsi is significantly more complex than nbd. Additionally, I need to emulate a virtual SCSI device (the scsi part in iscsi). SCSI is extensive—the command reference manual alone is over 500 pages long. This essentially constitutes its own project and will require a lot of testing against numerous iscsi initiators, operating systems, file systems, etc.

To keep the scope clear and the risks manageable, I decided to split my initial project into two parts. For now, I am focusing on implementing a working, Sia-backed virtual disk, as described above, and making it available via nbd. Once I have delivered on that, I will submit a proposal for part two—iscsi support. By then, I will also have a better idea of how this needs to be approached than I do now.

Development Information

Will all of your project’s code be open-source?

Yes, the code will be fully open source and will be made available on GitHub (Apache-2.0 & MIT licenses). Furthermore, all libraries used are also open source.

Leave a link where the code will be accessible for review.

A repository will be created on GitHub once the grant is approved.

Do you agree to submit monthly progress reports?

Of course

Contact Info

Email: [email protected]

Previous Related Projects

As a developer and long-time Linux-heavy user, I have many questions about your approach.
iSCSI or VBD devices operate at the block level, while SIA works at the file level. This is one of the main reasons why it is very easy to have NFS, FUSE and S3 support for SIA.

How will you do the “mapping” (I can’t find a better word) of a file to a block request? This is a very technical question, don’t hesitate to be technical.

Will this be coded in C, hence a Kernel Module? If you do, how will you deal with the “magic number”, do you have plans to deliver a DKMS package?

How are you going to deal with partial writes? SIA (renterd) doesn’t have that feature.

Thanks for your comment. I’ll try to explain everything in more detail.

Apologies if some of this seems too basic, I just want to make things clearer for everyone who is reading this.

To begin, I want to briefly mention the three main types of storage:

  • Block storage
  • File system access
  • Object storage, sometimes called blob storage

Natively, Sia provides an object storage interface to its data via renterd. However, this does not mean we cannot build middleware on top of it that provides the other two types. Several existing projects, including yours and mine, have implemented file system access on top of renterd. However, block storage is still missing, hence sia_vbd.

Regarding your question about how this is actually done, this was addressed in the proposal, mainly in the following sections:

However, I had to keep it brief and not too technical, so it might not have been as clear as it could have been.

Here is a more detailed explanation:

Introduction

What is a Block Device?

Essentially, a block device is a large, addressable blob. Simplified, it handles requests like:

  • Reading n bytes from offset o
  • Writing bytes [0xaa, 0xbb, 0xcc, ...] to offset o

That’s the gist of it. Traditionally, users would create a partition table on it, then create file systems on a partition, and then read/write files and directories within it, etc. All these actions occur at a higher level. The block device itself just deals with these read and write requests (and some others of course, but they are omitted here for brevity) and is blissfully unaware of what the data represents or how it is used. It operates on a lower level.

How to Interface with Block Devices

A Brief History

Historically, block devices such as hard disks, SSDs, USB sticks, etc., have usually been directly attached to a computer, and interfaces such as ATA, SCSI, and SATA were used to communicate with them. This is still the most common way today. At one point, people considered separating compute and storage. It made a lot of sense: data would not be bound to a specific compute node, and storage and compute could scale independently.

In the end, interfacing with a block device is just sending and receiving requests, so there is no reason why it has to be limited to a local bus and can’t go over a network.

Initially, people built expensive, dedicated fibre networks with special hardware that only handled storage-related traffic - the SAN was born. Later, people realized it could be more flexible and cost-effective to route storage-related traffic over the same infrastructure as all the other traffic.

This led to:

Network Protocols

Several protocols emerged to address this need, most notably:

  • iSCSI: Essentially SCSI tunneled over IP. Although the network protocol itself is moderately complicated, it comes with decades of legacy as its functioning depends on the tunneled SCSI commands. It makes a lot of sense for what it was designed for: exporting an existing SCSI device over IP. However, if you want to make other, non-SCSI block devices available over the network, it is much more difficult - essentially, you need to emulate a SCSI device. And the SCSI command reference alone is over 500 pages long!

  • ATA over Ethernet (AoE): It was somewhat popular for a brief moment, but being Layer 2 (and thus not routable) and also bound to ATA commands, it was rather short-lived.

  • nbd: NBD is much simpler than iSCSI and not tightly coupled to an underlying storage protocol. It only has a few commands and is very easy and straightforward to implement. I have previously written a Virtual Block Device with an NBD interface, and it was mostly a straightforward task. NBD is still around and in use, but it never reached the broad popularity of iSCSI, even though it actually predates iSCSI.

Virtual Block Devices

After clarifying what a block device is, it should now be clear that there is nothing stopping us from implementing one in software. In fact, most modern hardware block devices are implemented in software to a large degree. Every SSD is basically emulating a block device in its firmware. Every modern storage controller is essentially a dedicated embedded system running custom software (often on Linux) to provide block device interfaces to the underlying data. A whole industry has emerged around the term Software-defined Storage - software has really been eating the (storage) world.

How sia_vbd Works

I tried to answer this briefly in the proposal above, aiming to convey the core principles. However, due to the need for brevity, some details have been omitted. Here is a more detailed description:

Bird’s Eye View

sia_vbd implements Sia Object storage-backed Virtual Block Devices in software and exports them via nbd (and potentially iscsi in the future). It is a single process, written in Rust, as mentioned in the proposal, and runs entirely in userspace. It is supposed to run on all major platforms (Linux, macOS, Windows).

sia_vbd only implements the server (referred to as target in iSCSI terminology). Clients (referred to as initiators in iSCSI) connect to it over the network or localhost.

Once connected, the client (which typically runs in kernel space) makes the remote virtual block device appear as a local one. On Linux, when using nbd, these devices can usually be found under /dev/nbdX. These devices behave largely like local block devices. However, reads and writes are forwarded to sia_vbd over the network, where they are processed, and a response is returned. The internal workings of sia_vbd are abstracted away from the client. Internally, sia_vbd uses renterd, caching, and a local WAL (Write-Ahead Log) to manage operations. To the user, however, it appears as just a regular block device.

The Nitty Gritty

Firstly, sia_vbd is specifically designed for renterd and its properties and limitations. More specifically:

  • The fact that Sia Objects are immutable (I believe this is what you are referring to when you mentioned ‘partial writes’).
  • That latency when reading object data from renterd can be very high and is highly variable in practice.

It also tries to avoid unnecessary writes by first storing them in a local WAL and only uploading the new data to renterd when a certain amount has accumulated. These points have been mentioned in the proposal. I have mentioned the latency issue under Risks and have written about it extensively in my previous project.

Now, the technical details - how can sia_vbd actually provide a block device on top of Sia Object Storage:

Data Structures

The smallest unit of data is called a Block. A Block has a fixed size (256KiB for now, though this might change during development), and its id is the hash of its content (probably Blake3). This mean blocks are immutable. Writing to a block will result in a new block if the content has changed, as its hash and therefore its id will change.

Blocks are compressed and stored together in Chunks, which are just regular Sia objects. However, user metadata is used to indicate the blocks stored in the Chunk at their respective offsets. This way, we can determine the contents of a chunk via a quick (and inexpensive) HEAD request to renterd without needing to read the entire Chunk. On startup, sia_vbd will call renterd to quickly scan all the Chunk objects available at a specific bucket/prefix and then build an internal database of all known blocks and their locations.

Multiple blocks (256 for now, but subject to change) are grouped into a larger structure called a Cluster. A Cluster is a Merkle tree with its leaves being the ids (hashes) of the underlying blocks. The id of the Cluster is derived from the Merkle root of this tree. This has the same effect seen with Blocks—every write creates a new version of the affected Blocks and thus automatically creates a new version of the Cluster because it affects the Merkle tree and consequently the Merkle root.

Multiple clusters together form the State (still looking for a better name) of the block device. This is constructed in the same way as a Cluster, but instead of having Block ids as leaves, it is composed of Cluster ids. Here is some ASCII art to illustrate:

Block Device State (acac3b00518f.. | id = Root of Cluster Merkle Tree)
 ├─...
 ├─Cluster 6 (e882e6f6f8.. | id = Root of Block Merkle Tree)
 │  ├─Block 0   (e4fa1555ad.. | id = Content Hash)
 │  ├─Block 1   (b4b9b02e6f..)
 │  ├─...
 │  └─Block 255 (afe04867ec..)
 ├─Cluster 7 (200eee8955..)
 │  ├─Block 0   (9e1e0aba56..)
 │  ├─Block 1   (7cce966f55..)
 │  ├─...
 │  └─Block 255 (60cf670d17..)
 ├─...
 └─Cluster 15
    └─...

This structure allows us to efficiently manage the State of large block devices, e.g.:

  • Block size: 256 KiB
  • Block Id: 32 bytes (256 bit)
  • Cluster size: 256 Blocks
  • Clusters per GiB: 16
  • Cluster Id: 32 bytes (256 bit)
  • State data size: 32 bytes * 16 = 512 bytes

Given this structure, we can keep track of the State using only 512 bytes per GiB.

Reading

Here is an example of how a Read Request would be handled:

  1. A Read Request is received with the details: read 1024 bytes at offset 536871661.
  2. First, we map the offset to the corresponding cluster: offset 486720 is in cluster 8 (ID: ae3a3358b3459c761a3), with a relative offset of 749.
  3. We locate the corresponding block(s): cluster offset 749 is in Block 0, with a relative offset of 749.
  4. We know that Block 0 in Cluster 8 has the ID ce966f5503e292a5138 (Hash of its content).
  5. First, we check if ce966f5503e292a5138 is in our local cache. If it is, we retrieve it from the cache and skip to step 10. If not:
  6. We look up the Chunk(s) that contain Block ce966f5503e292a5138 in our database.
  7. We find it in Chunk 01924cfb-83ce-7ff9-88f7-b2b739f77019 at offset 849067, with a length of 1828502 bytes.
  8. We send a download request to renterd for 1828502 bytes at offset 849067 for the object /chunks/01924cfb-83ce-7ff9-88f7-b2b739f77019.chunk.
  9. The data is downloaded, decompressed, verified, and stored in the local cache.
  10. 1024 bytes at offset 749 are read from Block ce966f5503e292a5138 and sent to the client.

In practice, this will require strategies to minimize download latencies as much as possible if we need to request the Block from renterd. I worked a lot on this when optimizing sia_nfs.

Writing

Writes are somewhat more involved:

  1. A write request is received. Details: write [0x3f, 0xff, 0xa1, …] at offset 536871661.
  2. We look up the Cluster and Block as before: Cluster 8 (id ae3a3358b3459c761a3), Block 0, Block id: ce966f5503e292a5138.
  3. At this point, we essentially perform a copy-on-write by copying the Block first, modifying the data at offset 749, and recalculating the Block hash.
  4. The new Block hash is now: 8a7a08d7939550.
  5. Next, we append the new Block 8a7a08d7939550 and its data to the local WAL.
  6. We then update Cluster 8 and change Block 0’s id to 8a7a08d7939550. This triggers a recalculation of the Merkle root, thereby changing the id of the Cluster.
  7. Cluster 8’s id is now fb03f471c35ba13e, so we update it in the State, which leads to the recalculation of the State Merkle tree and changes the State Merkle root.

In practice, sia_vbd will not recalculate the state on every single write. For efficiency reasons, it will do so after a number of writes or when it receives an explicit flush command.

This is why I mentioned in the proposal that sia_vbd block devices are somewhat similarly structured to Git repositories. Essentially, every change creates a new, addressable state—much like a commit in a Git repository. This design works even though Sia’s Object store is immutable because we never overwrite anything. Instead, every modification creates a new state based on the previous one. This approach also gives us deduplication, snapshotting, and branching—essentially for free.

WAL Flushing

As mentioned above, new blocks created from writes are first appended to the local WAL. Once the WAL reaches a certain size, we flush the WAL and start a new one. Flushing means extracting the contained blocks, compressing them, and writing them to a new Chunk. We then upload this Chunk to renterd for permanent storage.

By flushing the WAL, we effectively perform “batch writes” instead of sending every single block to renterd, which would not be practical. However, this process introduces a risk of data loss if we lose an “unflushed” WAL due to hardware failure or other issues. This is a necessary tradeoff to make the system work efficiently.

Garbage Collection

You might have noticed that we always append data but never overwrite (which isn’t possible anyway) or delete it. Instead, sia_vbd periodically runs a garbage collector. Its job is to find Chunks that contain a significant number of orphaned Blocks. Orphaned Blocks are blocks that are not currently referenced in any active State.

When the garbage collector identifies such Chunks, it consolidates them by first extracting the Blocks we still need to keep. These Blocks are then written to a new Chunk and uploaded to renterd. Once this process is complete, it is safe to delete the now obsolete Chunks.

So, old data does get deleted eventually, but not immediately.

That’s it for now

I hope this makes things clearer. Apologies for the lengthy post, but I wanted to make sure I explained everything in detail.

If you’re interested in reading more about this, there is an interesting paper on a similar system:

While it doesn’t work exactly the same way as sia_vbd, many of the core concepts are the same or at least very similar.

Hope this makes the inner workings of sia_vbd a bit clearer. Please keep asking if something is unclear, I appreciate the interest!

Ok, so you are going to enumerate blocks and save 1 block = 1 file. For example:
Block 0x320593 would be /sia_bucket/0x320593 file?

You mentioned you are going to use an NBD, so I am guessing you are going to write the NBD client (user space software) and server (this is where renterd is) and use the NBD kernel module to do the interface.

I found a very easy explanation of how NBD works here: linux NBD: Introduction to Linux Network Block Devices | by Kozanoglu Aysad | Medium
:slight_smile:

I also found this: nbdkit / nbdkit · GitLab

It seems it is a NBD server with plunging architecture.

No, this is not how it works. I’ve explained it in detail above.

No, sia_vbd implements the server-side, the nbd kernel module IS the client. It’s all been explained.

I suggest reviewing the proposal and my previous response, it should clarify everything.

ok, so you mention you are going to replicate the way GIT works. But at the end of the day, what I will see on the web interface of my renterd will be files that represent the cluster-block of data that NBD uses, such as:

/chunks/01924cfb-83ce-7ff9-88f7-b2b739f77019.chunk .

6. We look up the Chunk(s) that contain Block *ce966f5503e292a5138* in our database.
This is kind of confusing to me, why use a DB instead of using a directory hierarchy? Renterd returns 404 when an object doesn’t exist, there is less room for errors. Not to mention it will simplify your coding. Just create a function f(offset, len) = [(SIA_FILE_TO_READ, offset, len),]. If you read my siafs code (the write functions) you will see how I did it.

For example:

\sia_nbd_bucket
\ State X
\ Cluster Y
- Block X

Or if you want to keep it simpler:
\sia_nbd_nbukcet
\ chunck-X-offset-y

This also has the advantage that you have the requested data together (From the wording from point number 6, I am assuming it could return more than one chunk). Renterd (by design) will split into chunks and spread them into different Hostd servers, I think there is no need to split it on top of that.