This proposal seeks grant funding for the development of a new project by Fabstir as part of
Fabstir AI, focused on storing Elasticsearch snapshots on Sia - Decentralised data storage. Elasticsearch is a highly scalable and distributed search and analytics engine. It is commonly used in the context of AI to store, search, and analyze large volumes of data, including textual data, which is relevant to many AI applications.
An Elasticsearch snapshot is a backup of a running Elasticsearch cluster or a subset of its indices stored in a repository. Snapshots include the AI data plus scripts and queries. Snapshots can be used to recover data after deletion or a hardware failure, or to replicate data across nodes or transfer data between clusters or applications. Snapshots only copy the data that has changed since the last snapshot, saving storage space and network bandwidth.
The objective is to create an alternative ecosystem for storing custom A.I. model data via Elasticsearch snapshots that would typically be stored on AWS, Azure, or Google Cloud. Hence shifting away from centralised cloud providers to embrace open-source LLM models and decentralised storage on SIA Network.
We also will guarantee privacy when it comes to your usage history and custom data. For example, OpenAI (an oxymoron name as their code is closed-source), creators of ChatGPT, have a window of a month according to their disclaimer, to sift through your data. That should automatically rule out any enterprise usage and for anyone who cares about privacy and the possibility of data being used against them.
This project aims to empower Web3 projects to integrate AI features while upholding the principles of decentralisation and democratisation.
The purpose of this project is to establish a secure and scalable solution for storing Elasticsearch snapshots on SIA decentralised storage. By leveraging open-source LLM models, such as those from Huggingface, and ensuring data encryption on decentralised storage, we intend to provide a viable alternative to centralised corporations, allowing for a true democratic approach to AI development and its impact.
The AI industry is predominantly controlled by large corporations such as Microsoft and Google. The decision-making power lies in the hands of the company heads and boards, primarily driven by shareholder interests. This centralised control poses challenges for creating a democratic ecosystem where the people can actively participate in shaping the future of AI. Our project aims to address this problem by utilising open-source models and decentralised storage, providing an alternative platform where decisions about AI can be driven by the community.
The proposed solution involves the development of software that enables SIA storage to serve as a repository for Elasticsearch snapshots. To develop the required code, plugins and interfaces, we will hire three programmers; Jules Lai and Abdur Rub S., who possess the necessary expertise in Web3 and AI respectively, plus a third developer.
Additionally, we will create a user-friendly front-end browser GUI dapp that allows users to upload documents and other materials to an open-source LLM model for customisation, specifically targeting the Falcon AI model (a large language model with 40 billion parameters trained on one trillion tokens). We will ensure that the Elasticsearch index and relevant data can be securely stored as snapshots on SIA decentralised storage.
The explosive growth of AI models and the increasing demand for AI data present a significant opportunity for the SIA decentralised storage network ecosystem. Fabstir AI, through this project, aims to store users’ training and usage data on SIA decentralised storage, capitalising on this boom.
By leveraging the SIA network, Fabstir AI addresses several key challenges and provides unique advantages for AI data storage. Firstly, the SIA network’s decentralised nature ensures data sovereignty and empowers users to have greater control over their data. With the increasing concerns over data privacy and security, this approach aligns perfectly with the needs of Web3 AI developers and users.
Secondly, the scalability and cost efficiency of the SIA network make it an ideal choice for storing the massive amounts of data required for training and operating AI models. As the demand for AI data continues to grow exponentially, the SIA network’s ability to seamlessly scale storage capacity provides a cost-effective solution for Fabstir AI, other companies and individuals.
Moreover, by leveraging SIA decentralised storage, Fabstir AI contributes to the vision of a true Web3 ecosystem, where AI technologies can integrate with decentralised applications (Dapps) while maintaining the principles of decentralisation and data ownership. By offering an alternative to centralised cloud providers, Fabstir AI enables Web3 projects to embrace AI features and rightfully call themselves Dapps, ensuring a democratic and transparent approach to AI development
Furthermore, the collaboration between Fabstir AI and the SIA Foundation opens avenues for fostering innovation and community involvement within the SIA ecosystem. The project showcases the power of the SIA network in handling the substantial data storage demands of AI models, attracting AI developers, researchers, and enthusiasts to explore the capabilities of decentralised storage and contribute to the growth of the SIA ecosystem.
In conclusion, the synergy between the booming demand for AI data and the capabilities of the SIA decentralised storage network presents a unique opportunity for Fabstir AI’s project. By storing users’ training and usage data on SIA, Fabstir AI not only addresses the challenges faced by the AI industry but also contributes to the Web3 ethos of a decentralised and democratic future. This collaboration between Fabstir AI and the SIA Foundation is poised to drive innovation, foster community participation, and shape the future of AI storage.
Together, we can build a more inclusive and democratic ecosystem where the fate of AI and its impact is decided by the people.
An AI system that uses Elasticsearch can consist of several components, such as:
- Data: The raw or processed data that is used for training or inference of AI models. Data can be stored in Elasticsearch indices or data streams, which can be included in snapshots.
- Models: The trained or imported AI models that are used for inference or prediction tasks. Models can be stored in Elasticsearch feature states, which cannot be included in snapshots, but can be backed up and restored using the Features API.
- Scripts: The scripts that are used to define the logic or parameters of AI tasks, such as natural language processing, vector search, etc. Scripts can be stored in Elasticsearch ingest pipelines, which are part of the cluster state and can be included in snapshots.
- Queries: The queries that are used to interact with the AI system, such as asking questions, searching for documents, finding anomalies, etc. Queries can be stored in Elasticsearch indices or data streams, which can be included in snapshots.
Therefore, snapshots can store the AI model data, scripts, and queries. To back up and restore the entire AI system, we will use a combination of snapshots and feature states. The latter stores the AI models.
Elasticsearch was chosen for this project owing to its robust capabilities especially suited for Large Language Models (LLMs). Its primary advantages include:
Scalability and Performance: Elasticsearch’s horizontal scalability efficiently manages large datasets common in AI models. Its real-time processing enhances performance.
Robust Full-Text Search: Elasticsearch excels in extensive textual data search, supporting complex queries and providing rapid results, an essential feature for LLMs.
Distributed and Highly Available: Its distributed nature enhances system speed and reliability, ensuring high availability and fault tolerance—critical when dealing with significant AI model data.
Flexible Data Handling: As a document-oriented and schema-less system, Elasticsearch aligns perfectly with the evolving JSON-based structures often used in AI, offering flexibility in data management.
Integration with AI Tools: Easy integration with popular AI and machine learning tools simplifies the deployment and scaling of LLMs.
In essence, Elasticsearch’s versatility and robust capabilities make it an excellent choice for handling model data within AI projects. It is also open-source and free to use. From my experience so far, the Elastic team have been responsive to my queries. There is a commercial version with extra features but this project does not use these features for now.
Elasticsearch, has a wide range of uses beyond AI. Key uses include full-text search, providing fast results within vast text data; log and event data analysis, useful for finding errors and monitoring performance; and real-time analytics, offering quick complex queries for data-driven decision-making. Its scalability suits big data applications, with the capacity to distribute search across shards on multiple nodes. Elasticsearch supports e-commerce search, delivering speedy, relevant results, and application search, powering web application search functions. Its toolset, including Beats and Kibana, facilitates IT infrastructure performance monitoring. Elasticsearch can handle complex geospatial queries and function as a flexible, document-oriented NoSQL database. With this project, all these use cases can gain access to decentralised SIA storage.
Abdur Rub S. (AI/Data Science & DevOps): Abdur Rub S. is a proficient data scientist and associate solution architect known for his expertise in artificial intelligence (AI) and data science. With diversified experience in development and leadership roles, his focus lies in various areas, including Conversational AI, Machine Learning, Natural Language Understanding, Data Engineering, Big Data, Cloud Computing, and DevOps. Abdur possesses comprehensive skills and knowledge, demonstrating expertise in Python, AI chatbots, Django, GPT-3, ChatGPT, Natural Language Processing, Machine Learning, and Cloud Architecture. His passion for AI and data science has driven him to contribute to cutting-edge projects, combining his skills with a solid foundation in cloud computing and DevOps. Abdur Rub S. is committed to delivering innovative AI solutions, leveraging his experience and expertise to drive successful outcomes.
These bios provide an overview of Jules Lai’s background as a software developer, as a Web3 full stack developer, and his expertise in computing and mathematics as well as his continued contribution to the SIA community. In the case of Abdur Rub S., his bio highlights his diverse experience in data science, AI, and DevOps, showcasing his proficiency in various domains and technologies.
We request a grant of $70,000 to support the project’s development over a period of six months. The budget allocation is as follows:
Jules Lai to be the project manager as well as a developer.
- Jules Lai (CTO of Fabstir, London): $30,000
- Abdur Rub S. (AI/Data Science & DevOps, Pakistan): $10,000
- 3rd Developer with Java and DevOps skills: $20,000
- Marketing costs - $1500
- Workshops, docs and tutorials on the usage of AI with SIA storage and Elasticsearch - $1500
- Web3 + AI conference attendance - $1000
- Hardware, such as laptop - $2500
- Running fees for decentralised cloud computing (e.g., for cloud test environments, GPU costs) - $2500
- Miscellaneous costs (e.g. accounting, legal, unknown unknowns etc.) $1000
Month 1: Project initiation, requirement gathering, and planning
Month 2a: A working test model from Huggingface that can be trained efficiently with custom data.
Month 2b: Development of code to enable SIA storage as an Elasticsearch snapshot repository with feature states.
Month 3a: Development of front-end browser GUI dapp for document uploading and customisation
Month 3b: Deployment and replication testing of Elasticsearch snapshot repository on SIA storage
Month 4: Integration of SIA storage and Elasticsearch, dapp testing, and bug fixing (initial deployment)
Month 5: User interface enhancements and performance optimisation
Month 6: Final testing, documentation, and project delivery
The project will be developed using the MIT license, which ensures the open-source nature of the software and encourages collaboration and community involvement. By adopting an open-source approach, we aim to foster transparency, innovation, and wider adoption within the SIA community.
The project will utilise the following technologies:
- Elasticsearch: for indexing and search capabilities
- SIA decentralised storage: for secure and scalable storage of Elasticsearch snapshots
- S5 and s5client-js for CDN and API access to SIA
- Open-source LLM models (e.g., from Huggingface ): for AI model customisation
- Web3 technologies: for developing the user interface and interacting with decentralised storage
- The code developed will handle all the metadata JSON objects, serialised binary data etc. Plus tests will be performed with Elastic Cloud on Kubernetes (ECK), the official operator for managing Elastic Stack applications on Kubernetes, to ensure that replication and syncing works for battle ready deployments.
- Note that at some point Elasticsearch will introduce a more efficient stateless architecture and we plan to adapt to that.
Technical Challenges: Potential complexities in integrating SIA storage with Elasticsearch and ensuring seamless functionality.
Timeline: Unforeseen obstacles that may delay project milestones and deliverables.
Adoption: Encouraging Web3 projects to embrace decentralised storage and open-source LLM models, and promoting the benefits of a democratic approach to AI.
Beyond the proposed project timeline, Fabstir AI envisions further development and enhancement of the platform. This includes expanding support for additional open-source LLM models.
Offering premium subscription services for businesses.
Integrating with other decentralised storage and compute providers, to foster a community-driven ecosystem around AI and decentralised technologies.
We believe that this project aligns with the Web3 ethos and offers a unique opportunity to shape the future of AI through decentralised decision-making.
Thank you for considering our proposal. We look forward to the possibility of receiving grant funding to support this innovative project.