Grant Proposal: Decentralising AI Storage of Elasticsearch Snapshots

juleslai · June 20, 2023, 5:31pm

Project Name: Decentralising AI Storage of Elasticsearch Snapshots

Organisation: Fabstir

Primary Contact: Jules Lai (Founder and CTO of Fabstir)

Summary:

This proposal seeks grant funding for the development of a new project by Fabstir as part of Fabstir AI, focused on storing Elasticsearch snapshots on Sia - Decentralised data storage. Elasticsearch is a highly scalable and distributed search and analytics engine. It is commonly used in the context of AI to store, search, and analyze large volumes of data, including textual data, which is relevant to many AI applications.

An Elasticsearch snapshot is a backup of a running Elasticsearch cluster or a subset of its indices stored in a repository. Snapshots include the AI data plus scripts and queries. Snapshots can be used to recover data after deletion or a hardware failure, or to replicate data across nodes or transfer data between clusters or applications. Snapshots only copy the data that has changed since the last snapshot, saving storage space and network bandwidth.

The objective is to create an alternative ecosystem for storing custom A.I. model data via Elasticsearch snapshots that would typically be stored on AWS, Azure, or Google Cloud. Hence shifting away from centralised cloud providers to embrace open-source LLM models and decentralised storage on SIA Network.

We also will guarantee privacy when it comes to your usage history and custom data. For example, OpenAI (an oxymoron name as their code is closed-source), creators of ChatGPT, have a window of a month according to their disclaimer, to sift through your data. That should automatically rule out any enterprise usage and for anyone who cares about privacy and the possibility of data being used against them.

This project aims to empower Web3 projects to integrate AI features while upholding the principles of decentralisation and democratisation.

Purpose:

The purpose of this project is to establish a secure and scalable solution for storing Elasticsearch snapshots on SIA decentralised storage. By leveraging open-source LLM models, such as those from Huggingface, and ensuring data encryption on decentralised storage, we intend to provide a viable alternative to centralised corporations, allowing for a true democratic approach to AI development and its impact.

The Problem:

The AI industry is predominantly controlled by large corporations such as Microsoft and Google. The decision-making power lies in the hands of the company heads and boards, primarily driven by shareholder interests. This centralised control poses challenges for creating a democratic ecosystem where the people can actively participate in shaping the future of AI. Our project aims to address this problem by utilising open-source models and decentralised storage, providing an alternative platform where decisions about AI can be driven by the community.

The Solution:

The proposed solution involves the development of software that enables SIA storage to serve as a repository for Elasticsearch snapshots. To develop the required code, plugins and interfaces, we will hire three programmers; Jules Lai and Abdur Rub S., who possess the necessary expertise in Web3 and AI respectively, plus a third developer.

Additionally, we will create a user-friendly front-end browser GUI dapp that allows users to upload documents and other materials to an open-source LLM model for customisation, specifically targeting the Falcon AI model (a large language model with 40 billion parameters trained on one trillion tokens). We will ensure that the Elasticsearch index and relevant data can be securely stored as snapshots on SIA decentralised storage.

Data Storage Requirements on SIA Network:

The explosive growth of AI models and the increasing demand for AI data present a significant opportunity for the SIA decentralised storage network ecosystem. Fabstir AI, through this project, aims to store users’ training and usage data on SIA decentralised storage, capitalising on this boom.

By leveraging the SIA network, Fabstir AI addresses several key challenges and provides unique advantages for AI data storage. Firstly, the SIA network’s decentralised nature ensures data sovereignty and empowers users to have greater control over their data. With the increasing concerns over data privacy and security, this approach aligns perfectly with the needs of Web3 AI developers and users.

Secondly, the scalability and cost efficiency of the SIA network make it an ideal choice for storing the massive amounts of data required for training and operating AI models. As the demand for AI data continues to grow exponentially, the SIA network’s ability to seamlessly scale storage capacity provides a cost-effective solution for Fabstir AI, other companies and individuals.

Moreover, by leveraging SIA decentralised storage, Fabstir AI contributes to the vision of a true Web3 ecosystem, where AI technologies can integrate with decentralised applications (Dapps) while maintaining the principles of decentralisation and data ownership. By offering an alternative to centralised cloud providers, Fabstir AI enables Web3 projects to embrace AI features and rightfully call themselves Dapps, ensuring a democratic and transparent approach to AI development

Furthermore, the collaboration between Fabstir AI and the SIA Foundation opens avenues for fostering innovation and community involvement within the SIA ecosystem. The project showcases the power of the SIA network in handling the substantial data storage demands of AI models, attracting AI developers, researchers, and enthusiasts to explore the capabilities of decentralised storage and contribute to the growth of the SIA ecosystem.

In conclusion, the synergy between the booming demand for AI data and the capabilities of the SIA decentralised storage network presents a unique opportunity for Fabstir AI’s project. By storing users’ training and usage data on SIA, Fabstir AI not only addresses the challenges faced by the AI industry but also contributes to the Web3 ethos of a decentralised and democratic future. This collaboration between Fabstir AI and the SIA Foundation is poised to drive innovation, foster community participation, and shape the future of AI storage.

Together, we can build a more inclusive and democratic ecosystem where the fate of AI and its impact is decided by the people.

Storing Elasticsearch Snapshots on SIA

An AI system that uses Elasticsearch can consist of several components, such as:

Data: The raw or processed data that is used for training or inference of AI models. Data can be stored in Elasticsearch indices or data streams, which can be included in snapshots.
Models: The trained or imported AI models that are used for inference or prediction tasks. Models can be stored in Elasticsearch feature states, which cannot be included in snapshots, but can be backed up and restored using the Features API.
Scripts: The scripts that are used to define the logic or parameters of AI tasks, such as natural language processing, vector search, etc. Scripts can be stored in Elasticsearch ingest pipelines, which are part of the cluster state and can be included in snapshots.
Queries: The queries that are used to interact with the AI system, such as asking questions, searching for documents, finding anomalies, etc. Queries can be stored in Elasticsearch indices or data streams, which can be included in snapshots.

Therefore, snapshots can store the AI model data, scripts, and queries. To back up and restore the entire AI system, we will use a combination of snapshots and feature states. The latter stores the AI models.

Benefits of Using Elasticsearch for LLM Models

Elasticsearch was chosen for this project owing to its robust capabilities especially suited for Large Language Models (LLMs). Its primary advantages include:

Scalability and Performance: Elasticsearch’s horizontal scalability efficiently manages large datasets common in AI models. Its real-time processing enhances performance.
Robust Full-Text Search: Elasticsearch excels in extensive textual data search, supporting complex queries and providing rapid results, an essential feature for LLMs.
Distributed and Highly Available: Its distributed nature enhances system speed and reliability, ensuring high availability and fault tolerance—critical when dealing with significant AI model data.
Flexible Data Handling: As a document-oriented and schema-less system, Elasticsearch aligns perfectly with the evolving JSON-based structures often used in AI, offering flexibility in data management.
Integration with AI Tools: Easy integration with popular AI and machine learning tools simplifies the deployment and scaling of LLMs.

In essence, Elasticsearch’s versatility and robust capabilities make it an excellent choice for handling model data within AI projects. It is also open-source and free to use. From my experience so far, the Elastic team have been responsive to my queries. There is a commercial version with extra features but this project does not use these features for now.

Elasticsearch beyond AI

Elasticsearch, has a wide range of uses beyond AI. Key uses include full-text search, providing fast results within vast text data; log and event data analysis, useful for finding errors and monitoring performance; and real-time analytics, offering quick complex queries for data-driven decision-making. Its scalability suits big data applications, with the capacity to distribute search across shards on multiple nodes. Elasticsearch supports e-commerce search, delivering speedy, relevant results, and application search, powering web application search functions. Its toolset, including Beats and Kibana, facilitates IT infrastructure performance monitoring. Elasticsearch can handle complex geospatial queries and function as a flexible, document-oriented NoSQL database. With this project, all these use cases can gain access to decentralised SIA storage.

Bios

Jules Lai (CTO of Fabstir): Jules Lai, based in London, is the Chief Technology Officer (CTO) of Fabstir. With a strong academic background in computing and mathematics, Jules holds both a degree and MSc in these fields. Previously, he worked as a senior software developer, designing and implementing financial modeling software that were used for esteemed organisations such as Lloyd’s of London and the Bank of England. Beyond his professional endeavors, Jules has made a significant impact on the UK filmmaking and film business communities for over a decade. As CTO of Fabstir, Jules has employed his skills in JavaScript/TypeScript, Solidity, React, Rust and devops to various Web3 projects. He has previously received a grant from the SIA Foundation and is actively working within the SIA open-source community, having developed code for video streaming using S5, a Metamask plugin and encryption enhancements to s5client-js, as well as a new platform for video transcoding plus examples of cloud deployments for codebase developed; going beyond the original project proposal.

Abdur Rub S. (AI/Data Science & DevOps): Abdur Rub S. is a proficient data scientist and associate solution architect known for his expertise in artificial intelligence (AI) and data science. With diversified experience in development and leadership roles, his focus lies in various areas, including Conversational AI, Machine Learning, Natural Language Understanding, Data Engineering, Big Data, Cloud Computing, and DevOps. Abdur possesses comprehensive skills and knowledge, demonstrating expertise in Python, AI chatbots, Django, GPT-3, ChatGPT, Natural Language Processing, Machine Learning, and Cloud Architecture. His passion for AI and data science has driven him to contribute to cutting-edge projects, combining his skills with a solid foundation in cloud computing and DevOps. Abdur Rub S. is committed to delivering innovative AI solutions, leveraging his experience and expertise to drive successful outcomes.

These bios provide an overview of Jules Lai’s background as a software developer, as a Web3 full stack developer, and his expertise in computing and mathematics as well as his continued contribution to the SIA community. In the case of Abdur Rub S., his bio highlights his diverse experience in data science, AI, and DevOps, showcasing his proficiency in various domains and technologies.

Budget:

We request a grant of $70,000 to support the project’s development over a period of six months. The budget allocation is as follows:

Jules Lai to be the project manager as well as a developer.

Jules Lai (CTO of Fabstir, London): $30,000
Abdur Rub S. (AI/Data Science & DevOps, Pakistan): $10,000
3rd Developer with Java and DevOps skills: $20,000

Miscellaneous Expenses: breakdown

Marketing costs - $1500
Workshops, docs and tutorials on the usage of AI with SIA storage and Elasticsearch - $1500
Web3 + AI conference attendance - $1000
Hardware, such as laptop - $2500
Running fees for decentralised cloud computing (e.g., for cloud test environments, GPU costs) - $2500
Miscellaneous costs (e.g. accounting, legal, unknown unknowns etc.) $1000

Proposed Timeline (6 months):

Month 1: Project initiation, requirement gathering, and planning
Month 2a: A working test model from Huggingface that can be trained efficiently with custom data.
Month 2b: Development of code to enable SIA storage as an Elasticsearch snapshot repository with feature states.
Month 3a: Development of front-end browser GUI dapp for document uploading and customisation
Month 3b: Deployment and replication testing of Elasticsearch snapshot repository on SIA storage
Month 4: Integration of SIA storage and Elasticsearch, dapp testing, and bug fixing (initial deployment)
Month 5: User interface enhancements and performance optimisation
Month 6: Final testing, documentation, and project delivery

Open-Source Statement:

The project will be developed using the MIT license, which ensures the open-source nature of the software and encourages collaboration and community involvement. By adopting an open-source approach, we aim to foster transparency, innovation, and wider adoption within the SIA community.

Technology:

The project will utilise the following technologies:

Elasticsearch: for indexing and search capabilities
SIA decentralised storage: for secure and scalable storage of Elasticsearch snapshots
S5 and s5client-js for CDN and API access to SIA
Open-source LLM models (e.g., from Huggingface ): for AI model customisation
Web3 technologies: for developing the user interface and interacting with decentralised storage
Programming languages used will be Python, Java, JavaScript/TypeScript and Rust.
The code developed will handle all the metadata JSON objects, serialised binary data etc. Plus tests will be performed with Elastic Cloud on Kubernetes (ECK), the official operator for managing Elastic Stack applications on Kubernetes, to ensure that replication and syncing works for battle ready deployments.
Note that at some point Elasticsearch will introduce a more efficient stateless architecture and we plan to adapt to that.

Risks:

Technical Challenges: Potential complexities in integrating SIA storage with Elasticsearch and ensuring seamless functionality.
Timeline: Unforeseen obstacles that may delay project milestones and deliverables.
Adoption: Encouraging Web3 projects to embrace decentralised storage and open-source LLM models, and promoting the benefits of a democratic approach to AI.

Future Development:

Beyond the proposed project timeline, Fabstir AI envisions further development and enhancement of the platform. This includes expanding support for additional open-source LLM models.

Offering premium subscription services for businesses.

Integrating with other decentralised storage and compute providers, to foster a community-driven ecosystem around AI and decentralised technologies.

We believe that this project aligns with the Web3 ethos and offers a unique opportunity to shape the future of AI through decentralised decision-making.

Thank you for considering our proposal. We look forward to the possibility of receiving grant funding to support this innovative project.

Sincerely,

Jules Lai
CTO, Fabstir

Kinomora · June 29, 2023, 9:25pm

Hello Jules,

Thanks so much for your new proposal Jules!

The committee would like to see the following before funding another project:

the completion of the first Fabstir grant
more open-source code showing the technical progress of Fabstir

The committee also mentioned possibly refining the proposal to focus on the integration of Sia as a storage option for Elasticsearch, and not the GUI dapp part of the project related to the LLM. Could you comment on the importance of that part of the project?

Regards,
Kino on behalf of the Sia Foundation and Grants Committee

juleslai · June 30, 2023, 2:58pm

Hi Kinomora,

Thanks for your consideration.

I’m off the beaten path working on encryption and memory optimisation for s5/s5client-js with redsolver and parajbs-dev, so will be mostly open-source code from there this month. It’s a prerequisite for what I need working, in order to handle larger video files. This would obviously be beneficial to the community so progress made in other ways.

I am currently developing with an AI developer that I hired, an AI chat search engine for the video project, a customer service module and some other use cases. These would be trained with custom data. So was thinking of generalising it by adding a GUI to open-source so that anyone in the SIA community can use it for their own needs. This won’t be required if the preference is to focus on integration with SIA for Elasticsearch, hence can take this out of the proposal.

‘Focusing on the integration of Sia as a storage option for Elasticsearch’, is to be done anyway within next 3 months. In this case, everything can be halved, budget and time. I will have to bootstrap this (that I was really hoping not to do) and hire a Java dev to work with me.

I can’t put the AI work back six months, but I understand your concerns. I guess the timing is just off for this one?

Kinomora · July 14, 2023, 10:29am

Hello @juleslai,

We really appreciate the info Jules, but the committee has decided to reject this proposal.

Even though the timeline of your current work would dictate that the Elasticsearch grant start soon, the committee still prefers to see a completed grant proposal - or significant open-source progress on a first proposal - before granting a second proposal approval to the same team.

If the timing works out once the Fabstir grant comes to a close, please re-submit this proposal with a focus purely on the Sia integration, we’d be happy to see it.

Thanks,
Kino on behalf of the Sia Foundation and Grants Committee

juleslai · July 14, 2023, 6:45pm

Hi Kino,

Many thanks for the Committee’s consideration.

I am currently bootstrapping the work anyway as it’s necessary for my work.

Perhaps for open-source, it can be looked upon retrospectively at some point.

Best,
Jules