How Sia Works.
The Sia whitepaper has been out of date for quite a while now, having been written over a year ago. It may be a while yet before an updated whitepaper is released, but this post forms the foundation of what a new whitepaper may look like. This post is still a work-in-progress, but provides updated information which I do not believe is available anywhere else (except the code itself).
Sia is a network for remotely storing data. Typically called 'cloud storage', the core feature is that you can put data on the network and it will be available from anywhere in the world at a later date. Putting data onto a network means that someone else - a 'host' - is going to be storing the data, and is going to be responsible for returning the data when requested. Sia makes several key assumptions about the network:
- Hosts cannot be trusted - if they are able, they will spy, steal, and cheat. Strong mechanisms must be used to discourage and prevent malice.
- Hosts are not charitable - hosts need to be paid, especially if the data is private or is large in volume. Payment must be guaranteed.
- Hosts are unstable - a single host, and even a group of hosts, is liable to go offline even if they have a history of 100% reliability
- The network is hostile - if there is a way to be abusive, someone will discover it and cause abuse.
Sia is able to safely store data on a network that has the above properties. There are three core strategies employed by Sia to ensure the safety of data. The first is encryption, which serves to protect the privacy of the data even when the hosts are trying to view the data. All data on Sia is encrypted before it is ever sent over the network, and it is only decrypted after it has been downloaded. The hosts will never be able to view decrypted data. The second strategy is redundancy. Data is not given to one or two or three hosts, but instead a myriad of hosts. Using erasure coding techniques such as Reed-Solomon coding, a high reliability can be achieved even without a high redundancy. The final strategy is to align the incentives of the hosts by paying them only if they store the data, but also by guaranteeing that they will get paid for storing the data, even if the renter is not online to make the payment. This can be achieved using a file contract, and a file contract can be achieved using a blockchain.
A file contract is an agreement between a renter and a host. The renter agrees to pay the host for storing a file, and the host agrees to store the file for a certain period of time. The renter and the host both put money into the file contract at the beginning. The money from the renter will be payment for the host after the contract is fulfilled. The money from the host is collateral that the host will forfeit if the contract is not completed. The file contract goes onto the blockchain, which will serve as escrow. When the file contract is over, the host must provide a proof of storage to the blockchain proving that the file is still being stored. After the proof of storage is provided, the host's collateral is returned and the renter's payment is made to the host. If the proof of storage is not provided in time, the money is forfeit.
The file contract is enough to provide strong incentives that the host keep the file. Keeping the file provides a financial income, and losing the file results in a financial penalty. For the act of storing data, a combination of encryption and the file contract covers the first two bullet points (hosts cannot be trusted and are not charitable). There is still no guarantee that the host will not be holding the data hostage, protections against this are discussed later.
The renter does not want to rely on a single host, even with all of the financial incentives and commitments in place. The inescapable truth is that a single host is always at risk of unexpected downtime or failure (even if the host is trustworthy). This risk can be minimized by storing the data on multiple hosts. If the full data is stored on 3 hosts, then all 3 hosts would need to go offline simultaneously in order for the data to be lost. It turns out however, that we can do substantially better than 1-of-3. Reed Solomon coding provides a way to store data such that M-of-N hosts can be used to recover data, and the redundancy is only N/M (which is theoretically perfect). Instead of 1-of-3, we can do 10-of-30 for the same redundancy of 3x. Switching to 10-of-30 gives us enormous reliability benefits - the chances that 21 drives out of 30 fail is substantially lower than the chances that 3 out of 3 drives fail. It turns out that if your hosts have 95% uptime, a 10-of-30 scheme provides a file uptime exceeding 99.999999999%. This math does assume that the hosts will be failing independently, but carefully selecting hosts by region should provide reasonable independence. If the hosts have 98% uptime (allowing for 30 minutes of down time every day, or 15 hours every month), the file can hit 99.999999999% with a 18-of-30 scheme, only 1.66x redundancy. This allows for incredible cost savings, greater insulation against attackers, and provides a large pool of hosts that can be used to download files with high parallelism.
This redundancy also provides insulation against hosts who may try to hold data hostage. In a 10-of-30 scheme, you only need 10 hosts to recover your data. Downloads on Sia are paid, which means a host gets revenue every time you download data from them. If 1 or even 15 hosts are malicious and try to hold data hostage, they can be fully ignored and instead the non-malicous hosts can be used. This has a direct opportunity cost for the malicious hosts - they lose revenue from the downloads. Even further, the renters will blacklist hosts if their download prices are consistently too high. The combined pressures of being unlikely to succeed, losing out on immediate revenue, and losing out on future revenue (in the form of future uploads + downloads) means that hosts are unlikely to perform this attack (and even if they do, it's not a big deal - just ignore them and use the honest hosts). Highly paranoid renters can get further protections by using a 3-of-30 or even something like a 2-of-100 scheme (which has high redundancy overhead) to protect their most sensitive files. In all likelihood though, 10-of-30 is already sufficient even for the most sensitive files. In practice, we've seen files maintain perfect reliability even during buggy prototype releases where average host reliability was below 50%.
Renters continuously observe the blockchain and the network to verify the uptime and reliability of hosts. Renters have a strong preference for hosts that are reliable, fast, and low-cost. Additionally, the renter typically only ever uploads to a small percentage of the total number of hosts. This creates a heavy pressure on hosts to perform better. The exact algorithm is still being determined.
At this point we've covered our bases for 3 of the original points (untrustworthy hosts, non-charitable hosts, and unstable hosts), both for uploading data and for being able to retrieve it. The vast majority of Sia heavily protected against malicious attackers. The Sia blockchain very closely resembles the Bitcoin blockchain, preserving the Proof-of-Work consensus mechanism, preserving the 10 minute blocktimes, and in general copying the Bitcoin blockchain wherever possible. A few well-known bugs (such as transaction malleability) have been fixed, but otherwise the design decisions of the Sia blockchain match the Bitcoin blockchain as much as possible. A strong form of encryption is used (Twofish with 256 bit keys), and all protocols in Sia assume that the other party is going to start behaving maliciously at any moment.
There is only one significant remaining problem, which is host selection. Renters are expected to choose their own hosts, and an attacker can attempt to manipulate the renter's selection criteria a number of ways, including by setting the price really low and by performing a Sybil attack. A Sybil attack is an attack where a single person (the attacker) pretends to be many. Online, a single person can fairly easily pretend to be 10 or even 10,000. In Sia, this means that an attacker might be able to spin up 10,000 machines each pretending to be an honest host, and then take advantage of renters.
A key part of Sia's approach to stopping Sybil attackers is proof-of-burn. Hosts burn coins by sending them to a provably unspendable address. Hosts are expected to burn a portion of their revenue (~4%) as a demonstration that they are real. Renters will select hosts that have burned coins with a probability that grows in a linear relationship to the total number of coins burned. Therefore, a host that has burned 2x as many coins will be twice as likely to be selected as another host that has all other factors the same. This provides a very important defence against Sybil attacks. An attacker that is trying to manipulate a renter will need to have all of the excess redundancy of a file before being able to commit an attack. For a file with 3x redundancy, that means the attacker will need to get at least 2.1x of that redundancy, which means that the attacker will need to burn enough coins to look like 67% of the network. That entails burning 1.5x as many coins as the rest of the network has burned combined. Especially as the network grows and matures, collecting that many coins is going to be prohibitive. Unlike proof-of-stake systems, it's not sufficient to just collect those coins, they actually need to be burned, which means there's no chance of recovering that investment. While not wholly infeasible, performing a Sybil attack on Sia should be more expensive than performing a 51% attack on Bitcoin. Even better, paranoid renters can protect themselves more fully by using a higher redundancy. Renters storing at 10x redundancy are safe unless the attacker looks like 91% of the network, requiring the attacker to burn 9x as many coins as the rest of the network combined to be successful.
While the exact algorithm for selection has not been finalized, most of the criteria is understood:
- Hosts are given a score, and then selected randomly according to their score
- The score goes up linearly with the number of coins burned - a linear relationship is required to prevent Sybil attacks
- Hosts are heavily penalized if their uptime is below 95%, and there is not a significant advantage to having an uptime greater than 99%. That is because the trust model explicitly chooses to assume that no host is more reliable than 99% - while the historic reliability is there the chance of betrayal or malice is also always present. 99% reliability across a 20-of-30 scheme is far more than sufficient to guarantee overall file reliability.
- Hosts are penalized exponentially for having a price that is higher than expected, but are not preferred exponentially for having a price that is lower than expected. 'Expected' is still undefined, but will likely be determined based on the real world cost of hard drives. Low prices cannot be preferred exponentially because this leaves room for Sybil attackers to get preference by setting the price too low. A host that is 2x the reasonable cost will have 1/32 the score, but a host that is 1/2 the reasonable cost will likely only have 2x the score.
- Host scores are increased linearly with the amount of collateral provided on the data, and a minimum amount of collateral is required. (to insulate against things like price volatility).
- Hosts that demonstrate dishonesty are blacklisted.
The selection algorithm is not a part of any protocol, but is instead determined on a per-renter basis. This means that as our understanding of selection strategies improves, we can push out updates to renters that do not break compatibility with the rest of the network. It also means that renters with special needs (such as EU-only due to data regulations) or heightened paranoia are able to use different selection strategies without friction.
I do encourage people to ask questions.
Thank you for tirelessly working on advancing the decentralized technology. I think we are in the very important point of time of the history. Technologies began overcoming the barriers of nation states. Now people can do business globally without the blind trust or regulation of government. It is a big step. I hope these decentralized technologies and organizations will help us human to live even more a productive, healthy, collaborative and peacefully society.
Currently researching decentralized cloud storage systems on the market and found out Sia. I am very interested in Sia and want to figure out how to participate. What kind of skills or resources do you guys need most right now to make Sia next level?
Another question is something relating to the cost of hosting services.
It seems like that the promises of many of decentralized cloud storages are stronger security and cheaper cost than the mainstream web storages such as Dropbox or Amazon cloud. I am not a technical person but I can understand that the nature of decentralized and crypto technologies have the resistance to censorship and promote privacy.
Something I haven't understood is that how this technology promote the lower cost? Where will the cost reduction come from? Sure we can start from renting extra space of our computers like Bitcoin mining started. Yes, we are already paying internet and PC. It doesn't give me the extra over head cost except an electric bill. Then I understand it has some cost advantage for the short term.
You have a vision that in some day Sia will be even able to host big companies like Netflix. In the point, I assume that economy of scale will kick in and the most of hosting will be done by large hosts. That happened to Bitcoin. So, in the point, what is the cost advantage of Sia vs Big centralized storages like Amazon, Google, or Dropbox?
Thx for the details, got two questions here:
Any decay logic when calculating the burnt coin for selecting the hosts? Other wise the latter joiners may be treated unfairly.
Any throughput downgrade because of encryption upload and decryption download?
Hi, thanks for the questions.
What kind of skills or resources do you guys need most right now to make Sia next level?
This is a question that really deserves its own thread, and maybe even its own board. I hope to answer this question in much greater depth in the coming week or two. At a high level, Sia need more testing, it needs lower resource consumption, but most importantly I think it just needs more testing and review. The best place to start is to look at the docstrings for all the packages and find something that looks interesting to you. Then start reading the code, and make comments/posts/messages on things you find interesting. Ask questions! If a docstring isn't helpful enough, bug us until we write a better one. If some code looks funky and you know a better way to write it, make a pull request!
The one caveat is consensus-critical code (mostly meaning the encoding package, and the consensus package). Those really can't be touched because it would introduce consensus risk.
Something I haven't understood is that how this technology promote the lower cost?
Sia creates an open marketplace for cloud storage, one where the only rules that matter are how efficient and how good your storage capabilities are. If you can offer better storage at a lower price, people will choose you and not your competitors. You will have more income, and your competitors will go out of business. This competition is always active, and always brutal, meaning innovation is very strongly supported and prices will constantly be driven into the ground.
Additionally, Sia's early life will be composed largely of leftover drives or recycled drives from other uses. These drives can be offered at much lower prices because there's very little cost to using them (the most expensive part of a drive is buying it, not maintaining it or powering it). In the long term though, it's the fact that people will be able to make money purely by having the best drives, and not by needing to build up a reputation or do any marketing.
In the beginning, I don't think there will be decay logic. But I think it would make sense to have burn start decaying after 6 or 12 months. New hosts will be required to do some early-burn (or offer storage for virtually free, or some hybrid combination), because that's how we protect against Sybil attacks.
Any throughput downgrade because of encryption upload and decryption download?
The encryption libraries that we use are very, very fast. The encoding and encryption combined should be running faster than 100mbps, so unless your upload speeds and download speeds are above that, you won't notice any performance hits. If they are above that, you may notice that you can't saturate because you are CPU-bottlenecked. I think there is room to add parallelism (the libraries may already be parallelized?), so industry-grade needs can compensate by using servers with more cores.
The encoding and encryption combined should be running faster than 100mbps, so unless your upload speeds and download speeds are above that, you won't notice any performance hits.
Already some home-level connections reach 1Gbps speeds. I doubt those speeds are attained in practice in most circumstances, but with something like Sia where you are typically downloading from multiple peers at the same time, they could theoretically be... so perhaps this limitation could quickly go from theoretical to practical. Just a potential issue to consider.
Thanks for this. I have a few questions.
The renter and the host both put money into the file contract at the beginning
Does this mean that in order to be a host, you need to have Siacoin first?
I was trying out Sia UI yesterday, left it running for half a day with 0 Siacoins but I think I got 3 contracts. Why is that?
How often does a host get paid?
If a host is also required to lock in a collateral, and assuming a host is getting new contracts all the time, if a host one day decides to leave, how would the host then leave without losing its collaterals?
How do I find out more about the contracts that I receive? I got 3, but how do I know the price and duration of the contracts I got?
Sia is still in beta, as such things on the network are pretty slow. The most important bottleneck today is that uploads are very slow, and tend to lock up a ton of coins. As such, almost nobody uploads to the network (though you can if you are willing to wait a day or so per GB, and if you are willing to overpay by about 100x).
Today, you do not need money to be a host, however that will change soon after we release the 1.0 product. Hosts will be much more successful if they allocate money to put up as collateral. It's annoying to get siacoins to join as a host, but it means that you are committed. A host that joins the network and then leaves after 2-3 weeks is actually very detrimental to file integrity, and because of hosts like these we need to have a very high redundancy today. In the future, it may take a few weeks for a host to get started, as renters will need to learn to trust the host. After the host has been around for a while, then the host will earn trust (renters will be continually probing the host with uptime checks) and renters will start to use the host. We will also have measures to prevent early hosts from becoming entrenched.
Today, 3 contracts per day sounds about reasonable.
The host will get paid usually after 12 weeks. Once you've been running for 12 weeks, you should start to see steady income.
If a host leaves suddenly, the host will lose of their collateral. This is by design, as we do not want hosts to leave suddenly. If the host wants to leave gracefully, the host should stop accepting new file contracts and then wait about 12 weeks. After 12 weeks, the host will have no more open file contracts, and the host will be able to leave without losing any collateral.
Right now, there's not much you can do to find out more about the contracts that you have. We'll be adding more information as Sia continues to evolve, but in the near term we are most heavily focused on improving the security and the upload speeds. We should have major upgrades ready in the next 2-3 weeks.
@Taek Thanks for your reply.
Looking forward to the next upgrades.
Right now for a new user who just got into Sia, like me, the conflicting information out there can be rather confusing for us to grasp things. I do understand that it's unavoidable until things start to stabilize.
All the best! Right now as a developer who's interesting in making Sia applications, do you think I should wait a few more releases or is it good to start dev today?
@in-cred-u-lous has made a few applications for Sia already, as have a few other users. He can give you an idea of what it's like as a third party developer. Parts of Sia are unstable, but we're planning on releasing the stable version on June 7th 2016, that's really just around the corner.
I would encourage you to get started on an application today, because then you can give us feedback before we lock down the api. Depending on how heavily you use the storage features, the application may not work very well on the current version but we plan on having something much better in the next 2-3 weeks.
We do not plan on changing the API very dramatically between now and the June 7th release. We may break one or two calls, but it should happen in a way that is relatively easy for you to fix. After June 7th, we are committing to full backwards compatibility for our API, which means that developers will have the green light to release the stuff they are working on, and can have confidence that users on future versions of Sia will still be able to use your applications.
What sorts of applications did you have in mind?
i just discovered this SIA, the decentralized cloud storage...
this is the future! perhaps will not able to replace centralized cloud, but will be a great alternative!