Why is sia hostd storing data on "volume" which is actually just a huge file ? Isn't that inefficient?

brainstorm · April 20, 2024, 4:12pm

My understanding of sia storage is that it uses volumes which are actually huge files. Is that correct ? Obviously, the data in that file is structured in some way since transfers and files stored are chunked at least. So basically each volume file constitutes a filesystem.
So my question is, why use that scheme ?

Filesystems such as ZFS already abstract the hardware, so whatever “volume” is created in hostd has no relation to the actual hardware. RAID, growing of the filesystem, caching, error-correction, compression, striping across hard drives etc. all that is already handled by the filesystem.

So we have a double layer of filesystems at work. That will just make things worse performance-wise and even could lead to some pathological cases.

In the case where the underlying filsystem is known to be very inefficient, and if we know hostd’s implementation is better, then ok sure letting it use even raw devices might make sense. That is what some databases like oracle do actually. But those are systems on a way different level. I think for a linux/BSD based system, it would be better to leverage the OS filesystem capabilities rather than reinventing the wheel.

I admit I have not looked at the code in depth (I am completely new to go actually), so I do not know for sure how hostd actually stores things in each volume, but I am pretty certain that if each chunk is at least several MB and is just written as a separate file, maybe in a not too wide directory tree, then the native OS filesystem can do just as well if not much better.

Also, I saw some references regarding striping volumes for better throughput. For the reasons above, I think that is a complete waste of effort that would be put to much better use advancing the core sia tools and server. It is also almost garanteed to degrade performance on a system that already does striping at the filesystem level or even hardware.

My opinion is overall:

use some shallow tree with individual files starting at whatever folder is designated as volume
if volumes are to be used for whatever reason, maybe small system with a couple drives and without advanced filesystem like ZFS or other, then make that optional. Also any features like striping across volumes compression, etc. make that completely optional.

Again, feel free to elaborate on where and how I am wrong, all opinions welcome.

Nate · April 20, 2024, 4:32pm

If you want to participate in a technical discussion, I highly recommend joining our Discord or posting directly on the repo in GitHub. Our engineering team does not check the forum often. Its primary purpose is for grants.

To answer your question, yes, hostd creates a large flat file to store sector data. There are some benefits and tradeoffs to storing data in this manner that I won’t get too deep into here. For the general case though, we found that a single flat file was easier to manage and benchmarked better for our workloads. We want to support different “volume” types in the future to take advantage of specific filesystem advantages. For instance, individual sector files make more sense on distributed filesystems, like Ceph, but that’s later in our roadmap.

brainstorm · April 20, 2024, 6:53pm

ok, where on github is it more appropriate ? the hostd github discussion page has two posts…
or open a new issue ?
just wondering where there is real discussion.
as for discord, it seems more adequate for chatting… not technical discussion but i guess that is a matter of taste.