Questions



  • The encryption (twofish) is independent of a 51% attack, right? So if a 51% attack was successful, one still wouldn't get access to any of the data? What could one do with such an attack?

    The blog mentions 10-of-30 redundancy scheme vs. 1-of-3 redundancy scheme, and reliability. How is reliability measured, and what does 10 of 30 vs. 1 of 3 mean? Since hosts have to prove that they have the data, presumably the requests for proof occur at random times?

    And I saw reference to hosts choosing which clients they want, and vice-versa, was this for an earlier version? From the screenshots it seems that the selections are now all automatic, and the software takes care of the reliability of the hosts, and so on?


  • admins

    Hi @bmlan, welcome to the Sia forum! You are asking some good questions.

    The encryption (twofish) is independent of a 51% attack, right? So if a 51% attack was successful, one still wouldn't get access to any of the data? What could one do with such an attack?

    A 51% attack refers to an attack on the consensus mechanism. The data is stored separately, so the data would not be affected by 51% attacks. Sia secures data payments + collateral through contracts, and those contracts depend on the consensus. So someone with a 51% attack could censor contracts, censor storage proofs, or even undo existing storage proofs. They could not however submit storage proofs for data that they did not have. This is a pretty significant attack, but at the very least the data encryption would hold.

    The blog mentions 10-of-30 redundancy scheme vs. 1-of-3 redundancy scheme, and reliability. How is reliability measured, and what does 10 of 30 vs. 1 of 3 mean? Since hosts have to prove that they have the data, presumably the requests for proof occur at random times?

    1-of-3 means the data is on 3 machines, and of those machines you need at least 1 of them to recover the data. If we assume that each copy is 90% reliable, we get (1-0.90)^3 as the probability of failure, or about 1-in-1000 chance that there's a failure. 10-of-30 means that the data is redundantly spread across 30 machines, and out of those 30 machines any 10 can recover the data. The probability that you will have 21 or more failures and be unable to recover the data can be expressed by the below function:

    http://www.wolframalpha.com/input/?i=p%3D0.9;+sum+((30+choose+i)(p^i)((1-p)^(30-i)),+i%3D0+to+10

    Instead of having a 1-in-1000 chance of losing the data, you've got a 1-in-1,000,000,000,000,000 chance of losing the data. This of course assumes that each host is independently at least 90% reliable. At this point in network maturity, we aren't actually seeing that much reliability, which is why the redundancy is so high. But nonetheless the math demonstrates: having more hosts but the same redundancy is a very powerful way to improve reliability.

    And I saw reference to hosts choosing which clients they want, and vice-versa, was this for an earlier version? From the screenshots it seems that the selections are now all automatic, and the software takes care of the reliability of the hosts, and so on?

    The selections have actually always been automatic, because we wish to make the user's life easier. We plan on adding in some optional tuning at some point, but we haven't gotten that far yet. Sia is still early. The tuning that we'll be adding includes:

    • favoring faster hosts over cheaper hosts
    • blacklisting certain hosts or regions
    • specifying that redundancy be spread over at least N continents
    • specifying that at least 1 professional datacenter be used to safeguard data

    And perhaps other similar options. This is probably 3-6 months away from becoming an actual option though.



  • Thanks. This is a very interesting idea.

    Regarding your example of x of n, if I am storing 1 file and 30 copies of it exist, then any of the 30 would be able to recover it? Are you saying that the 1 file is divided into pieces and spread out over 30 computers? And the algorithm is designed such that any 10 of the 30 would have access to all of the pieces of the file (if you were aiming for 10 of 30)? So still only 1 computer could be used to recover the data.

    Using the 1 of 3 example with 90% reliability and a 1 in 1000 chance of failure, it seems that this is an overestimate because it is not looking at the time dimension. If a particular host is hosting a file for 24 hours, how many times is it queried during that period? If it is sufficiently frequently so that this 24 hour period could be treated as a unit, does your reliability formula mean that 90% of the time, the file (or fragment) is intact? In which case, it matters when the 10% downtime occurred, and when this occurred for the other computers that are hosting it? I suppose that if a file was deleted or a host computer went offline, another copy would be made automatically to maintain the number?

    In my opinion, optional tuning should be restricted as much as possible. What would be the rationale for blacklisting certain hosts or regions? If it is for unreliability, the algorithm could figure that out. Redundancy over n continents could be automatically required. Your metrics could determine how "professional" one's datacenter is, and based on loss statistics, the necessity of having one professional datacenter could be determined and implemented. I can see that some users might want to prioritize speed, so that could be an optional factor.


  • admins

    Regarding your example of x of n, if I am storing 1 file and 30 copies of it exist, then any of the 30 would be able to recover it? Are you saying that the 1 file is divided into pieces and spread out over 30 computers? And the algorithm is designed such that any 10 of the 30 would have access to all of the pieces of the file (if you were aiming for 10 of 30)? So still only 1 computer could be used to recover the data.

    If you are storing 1 file and thirty full copies exist, you are using a 1-of-30 scheme. We use a scheme called Reed-Solomon coding. In a 10-of-30 scheme, you can use any of the 10 to recover the original file, and the total overhead is only 3x. If it sounds like magic, that's because it pretty much is. You can read about it more on Wikipedia, but basically it's super cool, and it's not even new technology - this stuff was invented in the 60's. It's used in wifi, on CDs, and for satellite communications iirc. It's a very old and well-trusted / highly-regarded algorithm.

    it seems that this is an overestimate because it is not looking at the time dimension.

    You would want to include the time/usage when you decide what counts as reliable. So, if you are saying that a disk is 90% reliable, that basically means that between checkup + recovery, there is only a 10% chance that the disk fails. The same exact hardware will then get a different reliability rating depending on how frequently you check it. If you check it every day and it only takes 1 day to repair a broken disk, the disk may be able to be rated at 98% reliability. But if you only check once a year, that same exact disk might have something closer to 80% reliability.

    On Sia we require users to do reliability checks, etc. every 6 weeks. We try to get the hosts to hit between 95% and 98% reliability over that period.

    In my opinion, optional tuning should be restricted as much as possible. What would be the rationale for blacklisting certain hosts or regions?

    Regulation and data law unfortunately can get in the way. For example, some EU laws require that your data be exclusively stored on EU servers. Also, some people just have fundamental bias against certain regions or countries.

    But regional stuff also comes into play massively once we start doing content distribution. For example, a Netflix movie may be more popular in India than in other parts of the world. If this is true for a particular file, it definitely makes sense that Netflix would have most of the redundancy in India for speed reasons, even if it's actually being uploaded from the US. Sia's ability to let the user control where in the world a file ends up will eventually be a huge asset.



  • Let's say I backup my HD using Sia. If the reliability of the host is checked every 6 weeks (and all hosts are checked q6 weeks), what assurance is there that my data will be intact and recoverable in 2 weeks, 3 weeks, or 4 weeks? Is the idea that since you are using a 10 of 30 scheme, and the times the hosts are checked is random (if it weren't random, then presumably the system could be gamed somehow), at any given time there will be many copies of the data?

    How complicated would it be to use Sia for this purpose (keeping a copy of my current HD files, updating at some interval, similar to how current commercial backup software works)?



  • Regarding using sia for backup.

    There is an ongoing project for a sia windows drive, maybe this is something for you.
    http://forum.sia.tech/topic/861/windows-storage-driver

    Other than that. If you want to use sia for backup:

    1. Make sure to make copies for sia renter folder and your seed, without it you cannot retrieve files.
    2. Use a backup program which create imutable archives, i.e. backup files that are created but never ever changed. Duplicity works like this for example. Possibly windows inbuilt backup does it too.
    3. Create a simple script that uploads all files in the backup archive folder to sia, just try to upload all files. Sia will ignore if the file is already uploaded.


  • Is the cost based on the total data storage, or on how much I upload? If I wanted to mirror my HD, I could then calculate cost as total initial storage + incremental changes.


  • admins

    I don't fully understand the question, but basically you are only charged for what you are using. If you are doing incremental backups, then you will only be increasing the amount of storage by a tiny bit each day.


Log in to reply