Thoughts on data permanence



  • With the bittorrent protocol we find there needs to be at least someone with 'a' chunk of data such that 100% of the data chunks are available.

    We also see that with torrents, sometimes a torrent can enter into an incomplete state, where one or more chunks are offline - in some cases never to return.

    My question to SIA community is - is... Is there ANY chance of data impermanence using SIA (this probably applies to STORJ, others similar too) ?

    I thought to myself earlier this morning perhaps SIA could keep a centralized 'backup' of everything in the case there ever was a loss from what I can't help calling 'last node syndrome' - probably an existing term for this anomaly.

    I do realize SIA team / developers have probably put great emphasis on reaching for zero data loss.

    One COULD argue, if SIA keeps a centralized back, aren't we back to centralized?

    I think the answer might be, decentralized for access venue, and centralized for stop loss measures to guarantee 100% data integrity.

    Where as - OneDrive ? GDrive, and some proprietary services - no doubt some wrap on Azure, AWS etc into a proprietary service/brand. I think consumers can still reach more affordable and equally as reliable services through a primarily decentralized, secondarily centralized solution.

    Thoughts anyone?



  • Why do you think that a single, centralized point of failure would act as a better protection against data loss than simply fine-tuning the distributed redundancy until you achieve as low a probability of data loss as desired?
    0% chance of data loss is impossible in any case. But Sia already makes it so that, if you're willing to spend, you can make the chance of data loss arbitrarily low.



  • I did see that one can pay more for more probability of permanence.

    True, even a local drive can fail. Remove cloud centralized can fail but I would find it very peculiar if Amazon or Azure ever lost a bit. Those are centralized up front, and distributed in a VERY controlled model behind the scenes - various mirrors set up.

    True, one could pay more for permanence.

    I can not for the life of me though think why it would be a bad idea for Sia to maintain at least one copy of 'everything'.

    Sure SIA could lose that - but it was never even really planned to be needed. A single backup would give people SOME sense of 'ok- if I don't get my data back because I didn't pay enough, at least there is a backup SIA has'.

    That's all I just think it would be a wise idea to have at least one pass at a master backup of everything and anything ever put out to SIA drive shares.

    I really have had it with this anti-centralized thinking.

    The two models compliment each other.

    Here- try this hat on - the Universe is a CENTRALIZED MODEL.



  • A galaxy is a centralized model at the core and decentralized at the perimeter.

    The Solar system that Allows for life- the SUN is a VERY NICE CENTRALIZED MODEL

    Notice it's at the CENTER of things.

    The ENTIRE PLANET is a REAL NICE CENTRALIZED SOLUTION - all that MASS - CENTRALIZED.

    And lastly - a VERY hard drive ITSELF is QUITE THE CENTRALIZED spinning SHINY disk isn't it ?

    So - enough with this Jim Jones Drikn the Decentralized Koolaid.

    Anyway- I do see your point, just saying, I think the healthy model would embrace both centralized AND decentralized, lest we have that all eggs in one data distribution and storage model basket -

    Way I see it corporate solutions provide centralized onramp TO utility of storage and without question they use an atomic decentralized backend for redundancy- as in - they mirror 100% of everything. SIA, and other BC models use decentralized up front and well ? there really IS no 100% - at any one place. I'm JUST saying that concerns me, and I think SIA - ok - if SIA doesn't want to - some NEW business that says- "While they can't guarantee you your data, WE CAN" but now we're just reselling ice cream cones in the desert- where clearly it really doesn't matter the flavor.

    Anywy - thank you LjL for responding - very excited about trying to sort all this out... Any new input is good input.


  • admins

    Hmm. Couple of thoughts:

    1. The decentralized model for storing data is absolutely superior to the centralized model for storing data. With Amazon or Microsoft, you've got a single-authority point of failure. While that seems pretty great, any number of traumatic things (like war!) can happen which might affect their ability to keep your data. A decentralized system has substantially increased resistance to these types of failures. You can emulate a 'single central authority' just by putting 1 copy of the data on an existing central authority such as AWS or Azure, but especially as Sia reaches maturity that will be both very expensive and unlikely to increase reliability more than just increasing the amount of redundancy you can use. Sia uses a customizable erasure coding approach to storing data. The default erasure code is Reed-Solomon coding, and the default redundancy on today's network is 6x. Reed Solomon Codes. In the long term, hosts will need to put up collateral on the data that they store. If you require a host to put up 10x collateral on data (you store $1 of data, and the host puts of $10 of collateral, which the host will lose if the host loses the data), you've basically made a decentralized SLA with the host. You can also improve reliability by using many hosts (even if you keep the redundancy the same). 30 hosts, 100 hosts, (max at 255). This will also really improve reliability. Finally, you can have a renter that does auto-repair on a file, which means that if hosts go offline, you can upload the file to new hosts. As long as you can check in every few weeks to do some repairs, the probability of the data sticking around permanently is very high.

    2. To put all data on a centralized sever would be both very expensive to us, and would be a centralized failsafe. Thing is, if host stability in the world is low enough that your decentralized files have become corrupt, there's probably a significant world event that's affecting the central server as well, especially because it is big. I really don't think it's something that makes sense. I realize switching from a central server paradigm to a decentralized paradigm is a big shift, but decentralization offers many advantages in terms of stability, uptime, and overall file survival.

    3. Uh, the universe has a lot of things that have a center, yes. But when you want reliability, decentralization tends to be the correct move. Biological ecosystems tend to be decentralized. Life exists in many forms, and while each affects the other there wasn't a central controlling force (until humans, anyway) that kept all the gears turning. You've got life in the cold areas, life in the hot areas, wet areas, dry areas, etc. An ant colony relies on a queen, but if you kill the queen the ant colony can find another. If you destroy the home, the ant colony can make another. And there are many ant colonies. You might be able to destroy one ant colony, but even the best human effort is unlikely to be able to exterminate all ants without also destroying the rest of the planet. You can think of the data like a bunch of ant colonies. Data gets split up into 'stripes', and then each stripe gets encoded into a bunch of 'sectors'. To recover the chunk, you only need a fraction of the sectors. You can think of the sectors as the ants, and the hosts as the ant colonies. If you destroy an ant, the host is largely unaffected. If you destroy an ant colony, the data is unaffected because the data can be recovered so long as some percentage (which, if you choose 10x redundancy, is only 10%!) of the ant colonies survive.

    That's probably among the more confusing analogies I've made. But if you analyze biological systems, I think you'll find that they are largely decentralized and this decentralization makes them much more robust. Sia, in some senses, was inspired by biological robustness and tries to achieve reliability through similar strategies.



  • @Tim-Miltz said:

    I can not for the life of me though think why it would be a bad idea for Sia to maintain at least one copy of 'everything'.

    I can think of some:

    • it would likely be expensive and complicated for NebulousLabs, which is a startup
    • it would provide a central place where an attacker could appropriate all data if they somehow beat the encryption
    • it would force the redundancy to be at least 2x when it could otherwise end up being much less, meaning that everything must be uploaded at least twice
    • it wouldn't possibly be superior to a client forcing data to be uploaded to at least one particular trusted host (or set of hosts), which can be done trivially by the renter without many changes
    • it would undermine the credibility of Sia as something championing the concept that distributed is better than centralized (it would feel like they don't even really trust their own idea)


  • Good points.

    I do think a completely decentralized solution is capable of providing comparable probability for permanence.

    Sounds like there is a cost I left out and that is in the details of what it would take to create some centralized - stop gap approach where at least some master copy is preserved. Indeed, I imagine the total size of all data stored via SIA (and others) quickly grows to unimaginable sizes, no doubt into exabytes past petabytes.

    That said, I understand better now.

    And on my analogy to Universe, Sun, Planet heh... Not sure if it makes sense to make that analogy. If anything the one underlying force of gravity there I suppose does pull or draw matter to one singular point, same as strong nuclear force too I suppose (of the 4 forces physics is always trying to unify). While it may 'appear' to be centralized, remove gravity and strong nuclear from the equation and there is no centralization. On the other hand I realize from this (not related to SIA) heh - seems two of the four forces in nature promote some sense of centrality.

    heh anyway, I am tempted to update my take on some centralized backup and say, maybe I'm missing the beauty here that you can reach the same cost efficiency with a purely decentralized data storage and delivery model... And that one CAN reach 'as good as' data permanence with pure decentralized solutions ? Indeed, why not go for it.

    Torrents have always awakened in me that reality, that there is a great value in that you CAN get 100% of a file even if not one single person sharing has 100% of the file. For fun sake, I'll comment and say over the years I started to wonder if this is HOW cultures emerge, how human culture (Physicist David Bohm points out culture is merely 'shared meaning') evolves and maintains permanence. No one person knows all the words in a language etc, yet collectively - through human networking social protocols "Hi, my name is X, nice to meet you, ... (hours later an exchange has taken place and one participant has now inherited 3 new words etc)".

    Indeed - SIA follows the very same functional decentralized model of information systems that clearly work, even allowing for me to type this very sentence.

    Hate to say it - while I'm not religious ? Seems to me religions offer up a centralized model for an ideology - ha- instead of one massive 'backup' - usually through some 'book' heh. And we ALL know the problems that emerge once we do adhere to a centralized model such as this... If that MASTER source is ever corrupted? it's game over for all participants trying to get the original media out of it - because it has become corrupted.

    OK I'm sold - LjL made a good point earlier, that even a centralized system, even WITH backups - that ANY system always has SOME risk in failing to deliver 100% on data permanence. The question then becomes... How to - as LjL said - how to tweak a decentralized approach to allow for varying levels of increased probability (Pay more). In fact, maybe decentralized IS more scalable in that way, because with Azure, AWS, the user can not control or contribute to decisions on just how many mirrors those services do and will make of their 'everything centralized and mirrored' approach. In other words, with SIA or any decentralized approach, the user can say - I REALLY REALLY want to make sure this data has higher probability of persisting 100%, and I'm willing to pay for that. Well no, I suppose one could pay more with Azure or AWS but I'm thinking MS and Amazon probably just say here - we have already made the decision on just how many copies we will maintain, expensive as they are, it's all embedded into the price...

    Ha - that's it perhaps. That price is high. And if you ever do go to increase storage space at AWS or Azure, you will see just how high that is... I've often looked at those prices and though, uh, this is nice it's remoted and secured but I'll just go RAID locally.

    Hmm - speaking of RAID, I do wonder how RAID concepts might be applied to decentralized storage models. I mean, heh, in a way, you ARE getting RAID n with decentralized, as in you ARE getting in some cases many many many copes through many many many participants.

    Something tells me decentralized CAN offer AS GOOD AS data permanence as the 'all in one building, plus mirrored in 5 other buildings' approach as with the 'cloud'. Funny, age 47, but age 8, I grew up using a mainframe and it was 'time share' - after 5 years sorting out the buzzword of 'cloud' I started to think, hey - we're just back to square one, remote data processing center you rent time on.

    Decentralized though is truly a paradigm shift.

    Perhaps anyone who ever ran SETI @ home - a distributed processing model that just worked well for certain kinds of problems? They likely said 'hmm - what if... people could have OTHER programs split up for distributed processing'... My take on Ethereum is ... that IS Ethereum to me, an ability for distributed processing to be standardized.

    So I now feel more comfortable SIA can provide AS GOOD AS data permanence as a centralized solution. Further, the WAY SIA works just very well be how any system of shared meaning (culture) works in the real world.

    Tempted to ask myself what 'incentives' exist in real world cultures heh, to preserve the language etc. I suppose purpose in utility at the end of the day is incentive enough.

    Myself ? I am looking at SIA for a way to increase probability of permanence for family photos etc, my own projects, just to have that one extra failsafe measure. Alas, I might be one of the only people at an exchange buying SIA not to trade, but to use. heh

    Off subject I suppose there is MAJOR ADDED BONUS of decentralized in that, MANY to ONE data requests have pretty much automatic load balancing in place. Say a company needs to persist an index, or catalog or something. They host it on some centralized site. Then they run their super bowl ad - 9 million people slam the site at half time. No one can get anything, centralized fails. But had they persisted it with SIA, and if SIA offers rules for data access (permissions 'WHO can read this' etc, which would mean permissions lists would have to be in the blockchain too I suppose - more metadata)? when 9 million people go to GET that file or whatever ? SIA model would be automatically pulling from many many many hosts, automatically balancing the load.

    Ok I digress, too much coffee.

    Thanks for insights. I've advanced my POV here, realized I didn't understand some things that I understand better now.



  • @LjL Heh- thanks for looking past my irrational rant - for most readers - first occurrence of Universe was probably a red flag.

    I see your points LjL.

    Dare I say it though, your comment above which I've certainly seen made regarding centralized models invite the attacker?

    This is true when there is no protocol allowing for decentralized solutions.

    Alas, take a nation state... I dare say- SOMEHOW (no names, means Arbusto in Spanish?) people have sold the citizens of the United States that the PROPER model is the Azure/AWS model - centralized.

    Meanwhile we all know there are ALWAYS going to be people who will exploit that model.

    So the solution politically moving forward is PROBABLY - a decentralized political model where protocol alone keeps the Oikos Nomos (Greek, eco-nomos or 'economy - means 'Order of the house').

    INDEED

    15 years from now "And you see kids ? We now realize that the benefit of moving to a decentralized global society is that there is no central target, thus ? nuclear weapons have no purpose, and we realize the problem wasn't terrorism (anationalists attacking) or simply war (nationalists attacking)? the problem that EVER LET nuclear warheads show UP on the plate of humanity ? Wasn't the technology, or even the people who would use such a hideous weapon of terror- the PROBLEM kids... was the centralized political model... Next class we are going to talk about the pitfalls of a centralized sovereign based fiat currency and the hyper inflation that we all witnessed in 2018 that led to the great die off of 3 billion people)- " (bell rings) "Ok class, that's it for today then, and remember, make sure you submit your class assignments to SIA by midnight for extra credit".

    Oh my - from that absurdity above - I just realized..

    What a wonderful solution, using SIA for education/class assignments.

    "Look Mr. Baskers (IS there a Mr. Baskers out there in this world ? IS THERE ? ) - I turned in my paper, it's in SIA - you can see the time and date it was submitted"

    Ugh- really... I need 17 hours of sleep.

    Thanks LjL for tolerating my irrational defensiveness earlier. You kept the focus on the issues well.



  • @LjL

    I think I am heading over to Augur and setting up a speculative survey to see if more people than not think Mr. Baskers sounds like a cats name.



  • @Taek

    Was just thinking on my porch and I thought:

    I need to send a Thank you for an elegant reply chock full of insight and knowledge.

    In OTHER words, thanks for taking the time for a length reply.


Log in to reply