Standard Grant: Chi

Project Name: Chi, A Community-Powered Platform for Multilingual Audio Data Collection

Name of the organization or individual submitting the proposal: Princess Innocent

Describe your project

Overview

Chi is a lightweight, privacy-conscious platform designed to collect, organize, and store audio recordings of indigenous languages. The platform enables native speakers to record translations of English words, phrases, and sentences using their own voices, forming a foundational dataset for its future AI translation model.

The global south is home to over 5000 spoken languages, yet most speech AI systems ignore or underrepresent them — due to a lack of accessible, labeled, and inclusive audio datasets. Without data, these languages risk digital extinction.

Endangered languages are currently dying at an accelerated rate because of globalization, mass migration, cultural replacement and linguicide etc. Approximately 454 known languages have become extinct in recent times, with over 3000 (43% of total) spoken languages considered endangered.

Existing efforts (like Mozilla’s Common Voice) barely scratch the surface of Asian and African language diversity and often rely on written text, which excludes non-literate speakers.

Whilst providing users with the list of all spoken lamguages, Chi solves this by:

  • Empowering native speakers to record spoken translations of AI generated prompts in their own languages.
  • Storing that data securely and decentralized, giving researchers, developers, and communities access to ethically sourced language data.

Chi web currently has:

  • 100+ users contributing
  • 400+ recordings
  • Over 90 languages from 3 continents recorded and counting.

These details are auto updated and can be viewed at the bottom of the home-screen. Link to proof of concept below


Who Benefits From Your Project?

  • Linguistics and Researchers
  • Accessibility, Representation And Preservation of indigenous languages
  • AI And NLP Developers
  • Educational Institutions

Open Access To Entire Library

Our entire recordings library will be made accessible and will support easier integration into ML pipelines for developers as we intend to do same.

we would maintain a database to handle all metadata organization.

Each record will store:

  • Language

  • Language code (glottolog)

  • Prompt (the word/phrase/sentence they were asked to record)

  • CID (pointer to the recording)

  • Recording_text (The transcribed word/phrase/sentence in the target language (optional, if available). Indexed for fast queries.

Develop a simple REST API:

Exposes data directly via REST endpoints.
Researchers/devs make HTTP requests to the API.

query parameters:

language (e.g, “Babanki”)

language_code (e.g, “baba1266”)

type (word /phrase/sentence)

date_range (for collection period)

Returns JSON objects with metadata + CID

  • Researchers and devs query exactly what they need (e.g., one to ten in dhanki, all efifa phrases, all Marathi sentences) fetch metadata, and resolve audio files via S5 CID

  • Directly fetch metadata + download links (via CID) without browsing a UI.

  • Always up-to-date, since API pulls from live database.


How Does The Project Serve The Foundation’s Mission Of User-owned Data?

1. Decentralized Preservation of Cultural Knowledge

Indigenous languages are disappearing faster than they can be documented. By using Sia:

  • We store cultural data securely and immutably.
  • Consolidating required dataset for model training.
  • The project demonstrates how Web3 tools can protect heritage not just finance, and we hope the Foundation sees it’s value and potential.

2. Data Ownership for Indigenous Contributors

Chi is designed so that native speakers contribute voice recordings with full knowledge and consent — and their contributions are stored on Sia’s decentralized network.

This ensures:

  • Transparency: Contributors can verify and access the content they help create.
  • Autonomy: No corporation, government, or institution can lock or alter the cultural data once it’s on Sia.

3. Model for Future Decentralized Datasets

Chi Voice will serve as a replicable framework for other regions and cultures to follow.

By showing how Sia can power large-scale, ethically sourced voice datasets, we:

  • Encourage developers and researchers to use Sia for decentralized data hosting
  • Create momentum for a new standard of AI dataset sovereignty

Are you a resident of any jurisdiction on that list? No
Will your payment bank account be located in any jurisdiction on that list? No


Grant Specifics

Amount of money requested and justification with a reasonable breakdown of expenses:

Use of Funds: $35,000 requested

Item Amount (USD)
Developer fees (full-stack platform development) $32,000
Open AI (gpt5) API fees (estimated 25 tokens per output $75/m) $1,000
Storage fees (1year) $1,000
App platforms fees $125
Web+Server hosting (1 year) $375
Logo design $500

What are the goals of this standard grant? Please provide a general timeline for completion.

Our goals are:

  • Develop a mobile application for the Chi platform.
  • Improve the existing web-app, UI/UX and add interesting features.
  • Integrate Sia storage.
  • Develop a simple public API for developers access to recordings library

Month 1-2: Mobile App Development (Cross-Platform)

  • Define mobile-specific features and UI changes.
  • Implement UI design.

Month 2:

  • Integrate audio input/output for mobile.
  • Implement push notifications.
  • Add offline mode & language caching.
  • Connect with back-end.

Month 3:

Sia-Storage Integration

  • Integrate Sia storage
  • Develop a simple REST API to act as a public-facing portal for library access
  • Move and store existing and future Audio files via S5

Month 4: Test and Fixes

  • Optimize for performance and battery use
  • Internal QA and bug fixes.
  • Beta release to test group.

Month 5: App Release

  • App Store & Play Store listing setup.
  • Official app launch

Potential risks that will affect the outcome of the project:

  • Low Participation in Rare Languages
    Some indigenous languages may have few active speakers, limiting dataset diversity.

  • Poor Audio Quality
    Background noise or unclear recordings may affect usability of submissions.

  • Incorrect Language Labeling
    Users may misidentify dialects, leading to inaccurate metadata.

  • Internet Access Constraints
    Contributors in rural areas may face challenges uploading recordings due to weak connectivity.

  • Legal Risk
    Data Privacy & Consent


Mitigations

  • Partner with local communities, NGOs and language groups to drive targeted outreach.
  • Provide in-app audio quality checks and guides for optimal recording.
  • Use verification by native speakers and cross-check with multiple submissions of same language.
  • Enable offline recording with later upload when connectivity improves.
  • Obtain explicit user consent, provide clear terms of use and comply with data protection laws.

Development Information

Will all of your project’s code be open-source? Yes
Leave a link where code will be accessible for review:
https://github.com/Chi-voice/voice-seed-vault

Do you agree to submit monthly progress reports?
Yes — we will submit reports on our progress here on the forum.

Have you developed a proof of concept for this idea already?
Yes, it can be accessed at https://chivr.tech/


Contact info

Email: [email protected]

Hello @Princess - thank you for the above proposal.

It looks like you’ve addressed the feedback from the Committee on your earlier proposals except for the recordings access point we were discussing in your previous application thread. Please make a point of specifying in the above post when and how you plan to allow for user access to the full recordings library (and not just their own). This part seems key to your core purpose of this project, by enabling users the ability to [store]:

that data securely and decentralized, giving researchers, developers, and communities access to ethically sourced language data.

And I recognize you acknowledge the need for explicit user consent for this portion in the Mitigations section but the implementation is just not clear.

With this edit completed by September 10, the proposal will be reviewed at the next Grants Committee Meeting on September 16.

Note: I’ve edited your proposal to reflect a new terminology change we’re implementing where we are asking everyone to refer to their developer work as “dev/developer fees” instead of “salary.”

Hello @mecsbecs thanks for helping with the edit.
I have implemented the required update to the proposal.

1 Like

Thanks for your proposal to The Sia Foundation Grants Program.

After review, the committee has decided to reject your proposal citing the following reasons:

  • The proposal does not demonstrate how Sia is an appropriate fit for this project given Sia’s technical reality, namely, files cannot be stored automatically in perpetuity on Sia as the project requires.

We’ll be moving this to the Rejected section of the forum. Thanks again for your proposal, and you’re always welcome to submit new requests if you feel you can address the committee’s concerns.

Thanks @mecsbecs
I have received the committee’s feedback but I do have some statements to make…

  • I believe, based off the proposal, I demonstrated how Sia fits the project and except I am in error, the committee is referring to the storage fees required to keep the data on the network after my grant expires. If this is the case, i should be updated so i can include that in my proposal

  • I had hinted that we have a business model which would cover for storage fees in the long run, but the details of that wasn’t included in the grant as I had gone through most successfully completed grants on here, and never really found them elaborate on their business models, I did same.

  • From my previously rejected proposal, the Chi grant has been on the docket for about a month now, and I was hoping by the time it took, after previous rejections, the committee brings my attention to every point that needs to be elaborated, which I had done for the questions asked.

  • Lastly, we are very aware of the valuable nature of the data we are, and will be collecting and hoped the foundation would see that too as we believe we’re aligned with her mission, but if not, we should be told outrightly.

Thanks again.

Hey @mecsbecs

If this proposal has been rejected, could you please move this onto the rejected section, just asking so that there may not be confusion.

Thanks!