Standard Grant Proposal: Sia Discovery Hub

juleslai · January 1, 2024, 1:31pm

Project Name: Discovery Hub for SIA Storage

Organisation: Fabstir

Primary Contact: Jules Lai (Founder and CTO of Fabstir)

Summary:

Allows content/data on Sia to be searched and discovered by asking a decentralised open-source AI LLM that will interpret the meaning of a user prompt to give more accurate search results. This is an opt-in service where the content creator decides what to index and users preserve their anonymity and privacy in all searches. Intermediaries, megacorps and other spying agencies will be unable to web-crawl said data.

Objectives:

The purpose of this project is to establish a secure and scalable solution for storing Elasticsearch snapshots on SIA decentralised storage to empower Web3 projects to integrate open-source AI LLM models while upholding the principles of decentralisation and democratisation.
To lead by example by the deploymnent of a live discovery hub where NFT owners can choose to register their digital assets to this decentralised registry to be found by search results returned from an open-source LLM model in response to user prompt search queries.
To make available an API, so that any dapp or app can plugin and register assets and utlise the Hub programmatically.
To build upon what has been developed over the last year for Fabstir Media Player (a Sia grant project) that enables the tokenisation and consumption of digital assets. To establish a live platform that utilises Sia for decentralised storage and an EVM compatible blockchain for smart contracts, where anyone can use the player to tokenise their content on to Sia and upload details to the registry for discovery by the public.
To ensure that fees are as low as possible yet enough to maintain the Hub’s ongoing existance.

The Problem:

The problem is that we live in a world where the content/data that you have created is monetarised and profited upon by intermediaries and megacrops for their own monetary gain, and you as the creator get either nothing or the last piece of the pie from whatever is left over. Blockchain technology has solved the property ownership problem via NFTs. An NFT stands for “non-fungible token” and is like a unique digital certificate of ownership for a specific digital asset. Hashing, encryption, and CIDs enable that content and data to be accessible, but how do we find the NFT and its contents?

Using Google search just gets us back to square one (Web 2.0) where megacorps control you and hold your data hostage to be sold to advertisers and others.

The Solution:

The proposed solution involves the development of a decentralised search engine that uses an open-source AI LLM model to truly revolutionalise content and data discovery to provide users exactly what they are seeking. A system that does not rank search results on who pays the most, but one that serves the wishes of communities and humankind. This index to be stored on Sia’s peer-to-peer storage to provide encryption, privacy, accessibility, redundancy/reliability to utilise S5’s CDN as links (CIDs) system to Web3.

Possible Use Cases:

Artists can tokenise their artworks, music, or videos as NFTs, ensuring they retain ownership and get fair compensation for their work. The decentralised search engine would allow fans to discover and directly support artists without intermediaries taking a significant cut.
Educators and institutions can tokenise their courses or educational materials. Students and learners worldwide can search for and access content directly from the creators, ensuring authenticity and potentially reducing costs.
Researchers can tokenise their papers and findings, making them readily available for discovery without the barriers often imposed by traditional academic journals and walled gardens. This could accelerate the sharing of knowledge and collaboration.
For discovering a wide range of tokenised digital assets, from ebooks to software tools. Creators can list their NFTs, and users can easily find unique and specific digital goods.
Journalists can tokenise their articles and reports, providing a direct revenue stream and an uncensored platform for news. Readers can find and support trustworthy and independent news sources.
Organisations can tokenise and upload historical documents, artworks, and cultural artifacts, making them easily discoverable and accessible for educational purposes, research, or general interest.
Individuals can tokenise and control their personal data and identity documents. This could be used for secure, private verification purposes without relying on central authorities.
Filmmakers and producers can tokenise movies, series, and other entertainment content, providing a direct link between creators and their audience, and offering a new model for distribution and profit-sharing.
Tokenising real estate properties or virtual land in digital worlds, making them searchable and tradable on a global scale without the need for traditional, often cumbersome, real estate processes.
Tokenising and securely storing individual health records, making them easily accessible to authorised users and ensuring patient privacy and data security.

The Tech:

This proposal seeks grant funding for the development of a new project by Fabstir to build open-source code to store Elasticsearch snapshots on Sia - Decentralised data storage. Elasticsearch is a highly scalable and distributed search and analytics engine. It is commonly used in the context of AI to store, search, and analyse large volumes of data, including textual data, which is relevant to many AI applications.

An Elasticsearch snapshot is a backup of a running Elasticsearch cluster or a subset of its indices stored in a repository. Snapshots include the AI data plus scripts and queries. Snapshots can be used to recover data after deletion or a hardware failure, or to replicate data across nodes or transfer data between clusters or applications. Snapshots only copy the data that has changed since the last snapshot, saving storage space and network bandwidth.

We also will guarantee privacy when it comes to your usage history and custom data. For example, OpenAI (an oxymoron name as their code is closed-source), creators of ChatGPT, have a window of a month according to their disclaimer, to sift through your data. That should automatically rule out any enterprise usage and for anyone who cares about privacy and the possibility of data being used against them.

This project aims to empower Web3 projects to integrate AI and search features while upholding the principles of decentralisation and democratisation.

Storing Elasticsearch Snapshots on SIA

An AI system that uses Elasticsearch can consist of several components, such as:

Data: The raw or processed data that is used for training or inference of AI models. Data can be stored in Elasticsearch indices or data streams, which can be included in snapshots.
Models: The trained or imported AI models that are used for inference or prediction tasks. Models can be stored in Elasticsearch feature states, which cannot be included in snapshots, but can be backed up and restored using the Features API.
Scripts: The scripts that are used to define the logic or parameters of AI tasks, such as natural language processing, vector search, etc. Scripts can be stored in Elasticsearch ingest pipelines, which are part of the cluster state and can be included in snapshots.
Queries: The queries that are used to interact with the AI system, such as asking questions, searching for documents, finding anomalies, etc. Queries can be stored in Elasticsearch indices or data streams, which can be included in snapshots.

Therefore, snapshots can store the AI model data, scripts, and queries. To back up and restore the entire AI system, we will use a combination of snapshots and feature states. The latter stores the AI models.

Benefits of Using Elasticsearch for LLM Models

Elasticsearch was chosen for this project owing to its robust capabilities especially suited for Large Language Models (LLMs). Its primary advantages include:

Scalability and Performance: Elasticsearch’s horizontal scalability efficiently manages large datasets common in AI models. Its real-time processing enhances performance.
Robust Full-Text Search: Elasticsearch excels in extensive textual data search, supporting complex queries and providing rapid results, an essential feature for LLMs.
Distributed and Highly Available: Its distributed nature enhances system speed and reliability, ensuring high availability and fault tolerance—critical when dealing with significant AI model data.
Flexible Data Handling: As a document-oriented and schema-less system, Elasticsearch aligns perfectly with the evolving JSON-based structures often used in AI, offering flexibility in data management.
Integration with AI Tools: Easy integration with popular AI and machine learning tools simplifies the deployment and scaling of LLMs.

In essence, Elasticsearch’s versatility and robust capabilities make it an excellent choice for handling model data within AI projects. It is also open-source and free to use. From my experience so far, the Elastic team have been responsive to my queries. There is a commercial version with extra features but this project does not use these features for now.

Bios:

Jules Lai (CTO of Fabstir): Jules Lai, based in London, is the Chief Technology Officer (CTO) of Fabstir. With a strong academic background in computing and mathematics, Jules holds both a degree and MSc in these fields. Previously, he worked as a senior software developer, designing and implementing financial modeling software that were used for esteemed organisations such as Lloyd’s of London and the Bank of England. Beyond his professional endeavors, Jules has made a significant impact on the UK filmmaking and film business communities for over a decade. As CTO of Fabstir, Jules has employed his skills in JavaScript/TypeScript, Solidity, React, Rust and devops to various Web3 projects. He has previously received a grant from the SIA Foundation and is actively working within the SIA open-source community, having developed code for video streaming using S5, a Metamask plugin and encryption enhancements to s5client-js, as well as a new platform for video transcoding plus examples of cloud deployments for codebase developed; going beyond the original project proposal.

Abdur Rub S. (AI/Data Science & DevOps): Abdur Rub S. is a proficient data scientist and associate solution architect known for his expertise in artificial intelligence (AI) and data science. With diversified experience in development and leadership roles, his focus lies in various areas, including Conversational AI, Machine Learning, Natural Language Understanding, Data Engineering, Big Data, Cloud Computing, and DevOps. Abdur possesses comprehensive skills and knowledge, demonstrating expertise in Python, AI chatbots, Django, GPT-3, ChatGPT, Natural Language Processing, Machine Learning, and Cloud Architecture. His passion for AI and data science has driven him to contribute to cutting-edge projects, combining his skills with a solid foundation in cloud computing and DevOps. Abdur Rub S. is committed to delivering innovative AI solutions, leveraging his experience and expertise to drive successful outcomes.

These bios provide an overview of Jules Lai’s background as a software developer, as a Web3 full stack developer, and his expertise in computing and mathematics as well as his continued contribution to the SIA community. In the case of Abdur Rub S., his bio highlights his diverse experience in data science, AI, and DevOps, showcasing his proficiency in various domains and technologies.

Budget:

We request a grant of $90,000 to support the project’s development over a period of nine months. The budget allocation is as follows:

Jules Lai to be part-time project manager as well as a developer.

Jules Lai (CTO of Fabstir, London): $40,000
Abdur Rub S. (AI/Data Science & DevOps, Elasticssearch, Pakistan): $10,000
3rd Developer with Java and Elasticsearch skills: $25,000

Miscellaneous Expenses: breakdown

Hardware, such as laptop - $2500
Running fees for decentralised cloud computing (GPU environment) - $5500
Miscellaneous costs (e.g. accounting, legal, other business fees etc.) $1000
Smart contract Audit - $4000
Contingency and Miscellaneous: Unplanned software upgrades or licensing costs, faulty hardware replacements, additional cloud, networking or compute costs, inflation, discovered costs for security measures and testing etc. $2000

Proposed Timeline (9 months):

Q1

Project initiation, requirement gathering, and planning
A working test model from Huggingface that can be trained efficiently with custom data.
Development of code to enable SIA storage as an Elasticsearch snapshot repository with feature states.

Q2

Front-end browser GUI dapp for description uploading and customisation plus integration of Fabstir Media Player.
Deployment and replication testing of Elasticsearch snapshot repository on SIA storage
Integration of SIA storage and Elasticsearch, dapp testing, and bug fixing (initial deployment)

Q3

User interface and API enhancements and performance optimisation
Deployment of live search platform for public beta testing
Final testing, documentation, and project delivery

Open-Source Statement:

The project will be developed using the MIT license, which ensures the open-source nature of the software and encourages collaboration and community involvement. By adopting an open-source approach, we aim to foster transparency, innovation, and wider adoption within the SIA community.

Technology:

The project will utilise the following technologies:

Elasticsearch: for indexing and search capabilities
SIA decentralised storage: for secure and scalable storage of Elasticsearch snapshots
S5 and s5client-js for CDN and API access to SIA
Open-source LLM models (e.g., from Huggingface ): for AI model pipeline and customisation
Web3 technologies: for developing the user interface and interacting with decentralised storage
Fabstir Media Player for digital asset tokenisation, media transcoding, consumption and registration to Sia Hub registry for AI vectorisation and search.
Programming languages used will be Python, Java, JavaScript/TypeScript and Rust.
The code developed will handle all the metadata JSON objects, serialised binary data etc. Plus tests will be performed with Elastic Cloud on Kubernetes (ECK), the official operator for managing Elastic Stack applications on Kubernetes, to ensure that replication and syncing works for battle ready deployments.
Note that at some point Elasticsearch will introduce a more efficient stateless architecture and we plan to adapt to that.

Risks:

Technical Challenges: Potential complexities in integrating SIA storage with Elasticsearch and ensuring seamless functionality. Performance and reliability is to be a strong focus. We do have a working prototype AI search engine that uses an open-source LLM model and Elasticsearch index but stored centrally on MySQL database.

Timeline: Unforeseen obstacles that may delay project milestones and deliverables.

Adoption: Encouraging Web3 projects to embrace decentralised storage and open-source LLM models, and promoting the benefits of a Web3 approach to decentralised search.

Future Development:

Beyond the proposed project timeline, further development and enhancement of the platform. This includes expanding support for additional open-source LLM model use cases as they improve.
Integration with NFT open market platforms to popularise decentralised AI search.
Expand to non-EVM blockchains.
Governance to empower the search network’s future based on community involvement

We believe that this project aligns with the Web3 ethos and offers a unique opportunity to shape the future of AI through decentralised storage and content search.

Thank you for considering our proposal. We look forward to the possibility of receiving grant funding to support this innovative project.

Sincerely,

Jules Lai

CTO, Fabstir

juleslai · January 9, 2024, 3:41am

Just to clarify that for searches, there will also be straight forward pattern matching as well of keywords, tags, with categories selection, content type selection, with various ways to sort etc.
Users can then save these filters along with any AI prompt for later retrieval.

Note that the project is also a gateway for other apps to build upon the Sia Discovery Hub API; via their own app, dapps and other AI programs. with the knowledge that the infrastructure is decentralised and encrypted.

steve · January 23, 2024, 7:14pm

We really appreciate this proposal and your other for the Game Dev Toolkit. They definitely have potential, but before reviewing any new proposals from your team the committee would like to discuss the current state of your original Fabstir grant, and any plans you might have to make it more usable by end users.

Oliver from the Foundation has reached out to schedule a time with you and our team. Feel free to leave this grant here in the Proposed section for now.

Thanks, and we look forward to speaking with you!

juleslai · January 24, 2024, 8:22am

Yes, that is good. I look forward to the opportunity to speak with you and/or your team. There is a lot of cool stuff (my opinion) that relates to the grant, that feeds into the plans :)

Thanks

juleslai · January 29, 2024, 6:22am

Hi, I made a video here, demoing an early prototype that shows the AI search in action. I type in natural language to find content that will form the basis of decentralising AI data and storing it on Sia.

My previous grant work “Fabstir Media Player”, will form the backbone of the Sia Discovery Hub, to enable people to easily tokenise their content and data for e.g. ownership and authenticity purposes. And it is intended that the Web3 registry will be for any kind of data and files, not just media, that users wish for others to find using the AI chat interface. The end result is for a live permissionless platform for anyone to use.

Optionally, people will be able to navigate to the content provider’s landing gallery page (provided by the media player with more expanded general-purpose features) for other content and data if one is available, or contact the content provider some other way.

juleslai · January 29, 2024, 6:02pm

If it seems like a lot of jargon, you can think of the project as giving content creators the option to decide what data they want to be discovered and it gets watermarked by blockchain tech that provides source of what account put up the data, giving creators the power of ownership rights and authentication.

As users get search results back. It will be displayed appropriate to source type. So, if it’s a pdf, it will viewed as a pdf, if it’s a webpage link, then viewed as a webpage, if its a video link, then video player will play it. You can still do text pattern matching, so paste in a link and up comes the content creators details (or whatever info the creator decides to part with) to help decide if the content is legitimate or if it’s not there then maybe it’s AI generated etc.

steve · February 8, 2024, 6:52pm

Thanks so much for your patience while we navigate how to proceed with your Fabstir grants. We also really appreciated you expressing the vision for Fabstir on our call. It definitely helped the committee understand your thought process with your various submissions.

We’re excited to consider the Game Dev Toolkit and Sia Discovery Hub, but after thorough discussion the committee would first like to see an additional grant to get Fabstir polished and user-facing. Specifically, they’d like to see the Fabstir media player have the ability to be used without requiring lengthy set up. It should have the ability to stream videos (transcoded and uploaded by you) with a public website or app.

Simply to keep the Proposed category clean as it may be a while before reconsidering, I’ll move this proposal to Rejected. If you want to re-submit this post for consideration in the future after completing the above requested Fabstir grant, let us know.

juleslai · February 9, 2024, 3:21pm

Thank you greatly for the support I’ve had with the grant. So much appreciated.

You’ve made me see just how important it is to ensure development (that I just love doing) is out there and “user-facing”. I’ll try to make it as frictionless as possible and bring people on a journey of understanding that us techies take for granted.

I will face that challenge. Thanks again.