Project Name: Discovery Hub for SIA Storage
Organisation: Fabstir
Primary Contact: Jules Lai (Founder and CTO of Fabstir)
Summary:
Allows content/data on Sia to be searched and discovered by asking a decentralised open-source AI LLM that will interpret the meaning of a user prompt to give more accurate search results. This is an opt-in service where the content creator decides what to index and users preserve their anonymity and privacy in all searches. Intermediaries, megacorps and other spying agencies will be unable to web-crawl said data.
Objectives:
-
The purpose of this project is to establish a secure and scalable solution for storing Elasticsearch snapshots on SIA decentralised storage to empower Web3 projects to integrate open-source AI LLM models while upholding the principles of decentralisation and democratisation.
-
To lead by example by the deploymnent of a live discovery hub where NFT owners can choose to register their digital assets to this decentralised registry to be found by search results returned from an open-source LLM model in response to user prompt search queries.
-
To make available an API, so that any dapp or app can plugin and register assets and utlise the Hub programmatically.
-
To build upon what has been developed over the last year for Fabstir Media Player (a Sia grant project) that enables the tokenisation and consumption of digital assets. To establish a live platform that utilises Sia for decentralised storage and an EVM compatible blockchain for smart contracts, where anyone can use the player to tokenise their content on to Sia and upload details to the registry for discovery by the public.
-
To ensure that fees are as low as possible yet enough to maintain the Hub’s ongoing existance.
The Problem:
The problem is that we live in a world where the content/data that you have created is monetarised and profited upon by intermediaries and megacrops for their own monetary gain, and you as the creator get either nothing or the last piece of the pie from whatever is left over. Blockchain technology has solved the property ownership problem via NFTs. An NFT stands for “non-fungible token” and is like a unique digital certificate of ownership for a specific digital asset. Hashing, encryption, and CIDs enable that content and data to be accessible, but how do we find the NFT and its contents?
Using Google search just gets us back to square one (Web 2.0) where megacorps control you and hold your data hostage to be sold to advertisers and others.
The Solution:
The proposed solution involves the development of a decentralised search engine that uses an open-source AI LLM model to truly revolutionalise content and data discovery to provide users exactly what they are seeking. A system that does not rank search results on who pays the most, but one that serves the wishes of communities and humankind. This index to be stored on Sia’s peer-to-peer storage to provide encryption, privacy, accessibility, redundancy/reliability to utilise S5’s CDN as links (CIDs) system to Web3.
Possible Use Cases:
-
Artists can tokenise their artworks, music, or videos as NFTs, ensuring they retain ownership and get fair compensation for their work. The decentralised search engine would allow fans to discover and directly support artists without intermediaries taking a significant cut.
-
Educators and institutions can tokenise their courses or educational materials. Students and learners worldwide can search for and access content directly from the creators, ensuring authenticity and potentially reducing costs.
-
Researchers can tokenise their papers and findings, making them readily available for discovery without the barriers often imposed by traditional academic journals and walled gardens. This could accelerate the sharing of knowledge and collaboration.
-
For discovering a wide range of tokenised digital assets, from ebooks to software tools. Creators can list their NFTs, and users can easily find unique and specific digital goods.
-
Journalists can tokenise their articles and reports, providing a direct revenue stream and an uncensored platform for news. Readers can find and support trustworthy and independent news sources.
-
Organisations can tokenise and upload historical documents, artworks, and cultural artifacts, making them easily discoverable and accessible for educational purposes, research, or general interest.
-
Individuals can tokenise and control their personal data and identity documents. This could be used for secure, private verification purposes without relying on central authorities.
-
Filmmakers and producers can tokenise movies, series, and other entertainment content, providing a direct link between creators and their audience, and offering a new model for distribution and profit-sharing.
-
Tokenising real estate properties or virtual land in digital worlds, making them searchable and tradable on a global scale without the need for traditional, often cumbersome, real estate processes.
-
Tokenising and securely storing individual health records, making them easily accessible to authorised users and ensuring patient privacy and data security.
The Tech:
This proposal seeks grant funding for the development of a new project by Fabstir to build open-source code to store Elasticsearch snapshots on Sia - Decentralised data storage. Elasticsearch is a highly scalable and distributed search and analytics engine. It is commonly used in the context of AI to store, search, and analyse large volumes of data, including textual data, which is relevant to many AI applications.
An Elasticsearch snapshot is a backup of a running Elasticsearch cluster or a subset of its indices stored in a repository. Snapshots include the AI data plus scripts and queries. Snapshots can be used to recover data after deletion or a hardware failure, or to replicate data across nodes or transfer data between clusters or applications. Snapshots only copy the data that has changed since the last snapshot, saving storage space and network bandwidth.
We also will guarantee privacy when it comes to your usage history and custom data. For example, OpenAI (an oxymoron name as their code is closed-source), creators of ChatGPT, have a window of a month according to their disclaimer, to sift through your data. That should automatically rule out any enterprise usage and for anyone who cares about privacy and the possibility of data being used against them.
This project aims to empower Web3 projects to integrate AI and search features while upholding the principles of decentralisation and democratisation.
Storing Elasticsearch Snapshots on SIA
An AI system that uses Elasticsearch can consist of several components, such as:
-
Data: The raw or processed data that is used for training or inference of AI models. Data can be stored in Elasticsearch indices or data streams, which can be included in snapshots.
-
Models: The trained or imported AI models that are used for inference or prediction tasks. Models can be stored in Elasticsearch feature states, which cannot be included in snapshots, but can be backed up and restored using the Features API.
-
Scripts: The scripts that are used to define the logic or parameters of AI tasks, such as natural language processing, vector search, etc. Scripts can be stored in Elasticsearch ingest pipelines, which are part of the cluster state and can be included in snapshots.
-
Queries: The queries that are used to interact with the AI system, such as asking questions, searching for documents, finding anomalies, etc. Queries can be stored in Elasticsearch indices or data streams, which can be included in snapshots.
Therefore, snapshots can store the AI model data, scripts, and queries. To back up and restore the entire AI system, we will use a combination of snapshots and feature states. The latter stores the AI models.
Benefits of Using Elasticsearch for LLM Models
Elasticsearch was chosen for this project owing to its robust capabilities especially suited for Large Language Models (LLMs). Its primary advantages include:
-
Scalability and Performance: Elasticsearch’s horizontal scalability efficiently manages large datasets common in AI models. Its real-time processing enhances performance.
-
Robust Full-Text Search: Elasticsearch excels in extensive textual data search, supporting complex queries and providing rapid results, an essential feature for LLMs.
-
Distributed and Highly Available: Its distributed nature enhances system speed and reliability, ensuring high availability and fault tolerance—critical when dealing with significant AI model data.
-
Flexible Data Handling: As a document-oriented and schema-less system, Elasticsearch aligns perfectly with the evolving JSON-based structures often used in AI, offering flexibility in data management.
-
Integration with AI Tools: Easy integration with popular AI and machine learning tools simplifies the deployment and scaling of LLMs.
In essence, Elasticsearch’s versatility and robust capabilities make it an excellent choice for handling model data within AI projects. It is also open-source and free to use. From my experience so far, the Elastic team have been responsive to my queries. There is a commercial version with extra features but this project does not use these features for now.
Bios:
Jules Lai (CTO of Fabstir): Jules Lai, based in London, is the Chief Technology Officer (CTO) of Fabstir. With a strong academic background in computing and mathematics, Jules holds both a degree and MSc in these fields. Previously, he worked as a senior software developer, designing and implementing financial modeling software that were used for esteemed organisations such as Lloyd’s of London and the Bank of England. Beyond his professional endeavors, Jules has made a significant impact on the UK filmmaking and film business communities for over a decade. As CTO of Fabstir, Jules has employed his skills in JavaScript/TypeScript, Solidity, React, Rust and devops to various Web3 projects. He has previously received a grant from the SIA Foundation and is actively working within the SIA open-source community, having developed code for video streaming using S5, a Metamask plugin and encryption enhancements to s5client-js, as well as a new platform for video transcoding plus examples of cloud deployments for codebase developed; going beyond the original project proposal.
Abdur Rub S. (AI/Data Science & DevOps): Abdur Rub S. is a proficient data scientist and associate solution architect known for his expertise in artificial intelligence (AI) and data science. With diversified experience in development and leadership roles, his focus lies in various areas, including Conversational AI, Machine Learning, Natural Language Understanding, Data Engineering, Big Data, Cloud Computing, and DevOps. Abdur possesses comprehensive skills and knowledge, demonstrating expertise in Python, AI chatbots, Django, GPT-3, ChatGPT, Natural Language Processing, Machine Learning, and Cloud Architecture. His passion for AI and data science has driven him to contribute to cutting-edge projects, combining his skills with a solid foundation in cloud computing and DevOps. Abdur Rub S. is committed to delivering innovative AI solutions, leveraging his experience and expertise to drive successful outcomes.
These bios provide an overview of Jules Lai’s background as a software developer, as a Web3 full stack developer, and his expertise in computing and mathematics as well as his continued contribution to the SIA community. In the case of Abdur Rub S., his bio highlights his diverse experience in data science, AI, and DevOps, showcasing his proficiency in various domains and technologies.
Budget:
We request a grant of $90,000 to support the project’s development over a period of nine months. The budget allocation is as follows:
Jules Lai to be part-time project manager as well as a developer.
-
Jules Lai (CTO of Fabstir, London): $40,000
-
Abdur Rub S. (AI/Data Science & DevOps, Elasticssearch, Pakistan): $10,000
-
3rd Developer with Java and Elasticsearch skills: $25,000
Miscellaneous Expenses: breakdown
-
Hardware, such as laptop - $2500
-
Running fees for decentralised cloud computing (GPU environment) - $5500
-
Miscellaneous costs (e.g. accounting, legal, other business fees etc.) $1000
-
Smart contract Audit - $4000
-
Contingency and Miscellaneous: Unplanned software upgrades or licensing costs, faulty hardware replacements, additional cloud, networking or compute costs, inflation, discovered costs for security measures and testing etc. $2000
Proposed Timeline (9 months):
Q1
-
Project initiation, requirement gathering, and planning
-
A working test model from Huggingface that can be trained efficiently with custom data.
-
Development of code to enable SIA storage as an Elasticsearch snapshot repository with feature states.
Q2
-
Front-end browser GUI dapp for description uploading and customisation plus integration of Fabstir Media Player.
-
Deployment and replication testing of Elasticsearch snapshot repository on SIA storage
-
Integration of SIA storage and Elasticsearch, dapp testing, and bug fixing (initial deployment)
Q3
-
User interface and API enhancements and performance optimisation
-
Deployment of live search platform for public beta testing
-
Final testing, documentation, and project delivery
Open-Source Statement:
The project will be developed using the MIT license, which ensures the open-source nature of the software and encourages collaboration and community involvement. By adopting an open-source approach, we aim to foster transparency, innovation, and wider adoption within the SIA community.
Technology:
The project will utilise the following technologies:
-
Elasticsearch: for indexing and search capabilities
-
SIA decentralised storage: for secure and scalable storage of Elasticsearch snapshots
-
S5 and s5client-js for CDN and API access to SIA
-
Open-source LLM models (e.g., from Huggingface ): for AI model pipeline and customisation
-
Web3 technologies: for developing the user interface and interacting with decentralised storage
-
Fabstir Media Player for digital asset tokenisation, media transcoding, consumption and registration to Sia Hub registry for AI vectorisation and search.
-
Programming languages used will be Python, Java, JavaScript/TypeScript and Rust.
-
The code developed will handle all the metadata JSON objects, serialised binary data etc. Plus tests will be performed with Elastic Cloud on Kubernetes (ECK), the official operator for managing Elastic Stack applications on Kubernetes, to ensure that replication and syncing works for battle ready deployments.
-
Note that at some point Elasticsearch will introduce a more efficient stateless architecture and we plan to adapt to that.
Risks:
Technical Challenges: Potential complexities in integrating SIA storage with Elasticsearch and ensuring seamless functionality. Performance and reliability is to be a strong focus. We do have a working prototype AI search engine that uses an open-source LLM model and Elasticsearch index but stored centrally on MySQL database.
Timeline: Unforeseen obstacles that may delay project milestones and deliverables.
Adoption: Encouraging Web3 projects to embrace decentralised storage and open-source LLM models, and promoting the benefits of a Web3 approach to decentralised search.
Future Development:
-
Beyond the proposed project timeline, further development and enhancement of the platform. This includes expanding support for additional open-source LLM model use cases as they improve.
-
Integration with NFT open market platforms to popularise decentralised AI search.
-
Expand to non-EVM blockchains.
-
Governance to empower the search network’s future based on community involvement
We believe that this project aligns with the Web3 ethos and offers a unique opportunity to shape the future of AI through decentralised storage and content search.
Thank you for considering our proposal. We look forward to the possibility of receiving grant funding to support this innovative project.
Sincerely,
Jules Lai
CTO, Fabstir