Incorporating trustful components in otherwise decentralized apps (decentralized web crawling)

Because Skynet can serve files, it can also serve self-contained browser-based apps written in JavaScript. With the registry, these apps can maintain and update state, using the same underlying decentralized storage model. While there is a lot that can be done with Skynet, it cannot do everything.

A specific example that comes to mind: I want to implement decentralized web crawling. In particular, I want to create a tool for quickly creating mirrors of web pages and storing them on Skynet. I have done the research and browser-based web crawling is not really a thing. While I might be able to achieve something if I, say, transpile a Node crawler into browser JS, I think it would be a better use of time to use a standard web crawler that runs on a server.

But this introduces an element of trust. Even if the app and the mementos (web archives) are stored on Skynet, with state tracked through the registry, you would still need to depend on a specific server with a specific piece of software. Here are my recommendations for mitigating this:

  • Server-based software components should be open source. This should go without saying. On top of this, it should be as easy as possible for others to deploy their own instance of the server software. Not only should it be feasible for others to take over should your own server go down, it should be possible for people to actively use alternatives.

  • Within the settings of the app, although a default server may be specified, the option should exist to specify an alternative server through a URL. Ideally there would be an option to persist these alternative server options (including the de-selection of the default option) through something like a SkyID account.

  • Then there is the generated data. Normally with a centralized archive provider you can be sure that their software is serving their archive because… it’s their server. But in this decentralized storage environment, mementos, or alleged “mementos,” can come from rogue servers or absolutely nowhere at all. When a memento is prepared, that content should be hashed and that hash should be signed by the server. You could verify the hash of the contents and then verify that the signature comes from a server-owned key. Whether the source is trustworthy is up to the end-user, but you should at least trust that the memento (or other computed/retrieved data object) comes from where it says it did.

To distill this into three basic principles: interoperable independent compute servers, freedom of server choice, and server attestation to outputs. Are there any dimensions I am missing?

I understand that, generally speaking, the ideal is to have the user do as much computation as possible, and emulation in JavaScript and things like WASM expand what is possible with browser-based apps. However, I suspect there may be situations where delegating to a central (or central-ish) compute server may be inevitable, for instance with large amounts of data or particular scenarios like web crawling. And in general I think we should be flexible. Skynet should not constrain developers, but free them, and Skynet will be at its best with a diverse ecosystem of use cases. I am interesting in hearing other thoughts.