Standard Grant: Chi-Voice

Project Name: Chi, A Community-Powered Platform for Multilingual Audio Data Collection

Name of the organization or individual submitting the proposal: Princess Innocent

Describe your project

Overview

Chi is a lightweight, privacy-conscious platform designed to collect, organize, and store audio recordings of indigenous languages. The platform enables native speakers to record translations of English words, phrases, and sentences using their own voices, forming a foundational dataset for its future AI translation model.

The global south is home to over 5000 spoken languages, yet most speech AI systems ignore or underrepresent them — due to a lack of accessible, labeled, and inclusive audio datasets. Without data, these languages risk digital extinction.

Endangered languages are currently dying at an accelerated rate because of globalization, mass migration, cultural replacement and linguicide etc. Approximately 454 known languages have become extinct in recent times, with over 3000 (43% of total) spoken languages considered endangered.

Existing efforts (like Mozilla’s Common Voice) barely scratch the surface of Asian and African language diversity and often rely on written text, which excludes non-literate speakers.

Whilst providing users with the list of all spoken lamguages, Chi solves this by:

  • Empowering native speakers to record spoken translations of AI generated prompts in their own languages.
  • Storing that data securely and decentralized, giving researchers, developers, and communities access to ethically sourced language data.

Chi web currently has:

  • 100+ users contributing
  • 400+ recordings
  • Over 90 languages from 3 continents recorded and counting.

These details are auto updated and can be viewed at the bottom of the home-screen. Link to proof of concept below


Who Benefits From Your Project?

  • Linguistics and Researchers
  • Accessibility, Representation And Preservation of indigenous languages
  • AI And NLP Developers
  • Educational Institutions

Open Access To Library

Our recordings library will be made accessible and will support easier integration into ML pipelines for developers as we intend to do same.

we would maintain a database to handle all metadata organization.

Each record will store:

  • Language

  • Language code (glottolog)

  • Prompt (the word/phrase/sentence they were asked to record)

  • CID (pointer to the recording)

  • Recording_text (The transcribed word/phrase/sentence in the target language (optional, if available). Indexed for fast queries.

Develop a simple REST API:

Exposes data directly via REST endpoints.
Researchers/devs make HTTP requests to the API.

query parameters:

language (e.g, “Babanki”)

language_code (e.g, “baba1266”)

type (word /phrase/sentence)

date_range (for collection period)

Returns JSON objects with metadata + CID

  • Researchers and devs query exactly what they need (e.g., one to ten in dhanki, all efifa phrases, all Marathi sentences) fetch metadata, and resolve audio files via S5 CID

  • Directly fetch metadata + download links (via CID) without browsing a UI.

  • Always up-to-date, since API pulls from live database.


Fremium Model

This API will allow developers and researchers to query and retrieve labelled audio data stored on Sia (via S5).

The API will read from Chi-Voice’s metadata database, generate signed URLs/CIDs from the S5 node, and return JSON responses containing both metadata (language, prompt, glottolog code, etc.) and secure links to the audio stored on Sia.

We will adopt a freemium model:

  • Free Tier:
    All registered users receive an API key with a generous free quota (for instance 1,000 requests per month). This ensures students, independent researchers and small projects can explore the dataset at no cost.

  • Usage Tracking:
    Every API request includes the user’s API key. Our middleware tracks usage in real time, comparing each request against the user’s quota. This approach guarantees predictable server load and equitable access.

  • Paid Tiers:
    When a user exceeds the free quota, they are prompted to upgrade to a paid plan. Paid tiers unlock higher call volumes (e.g 10,000+ calls/month) for institutional users, commercial projects and large-scale model training.

The API backend maintains:

a users table (ID, API key, plan, usage counters),

a plans table (quotas and pricing),

and middleware that enforces quotas, blocks overages, and logs usage.

This freemium model allows Chi-Voice to remain openly accessible for grassroots researchers while creating a sustainable revenue stream from high-volume users. It also positions the project for long-term maintenance and growth

Why Sia is the appropriate fit:

  • Sia provides a decentralized, cost-efficient archival storage layer with strong incentives for host stability. This aligns with our mission of making voice-data publicly accessible and citable.
  • Using Sia + S5 allows us to embed content identifiers (CIDs) into our metadata exporter, enabling reproducible dataset access and auditability (researchers can trace each file to the content hash).
  • Sia’s open-contract model supports long-term budgeting of storage at predictable rates (we model this in our cost plan). That enables us to commit funds ahead-of-time for contract renewals, aligning with our freemium API monetization plan.

How Does The Project Serve The Foundation’s Mission Of User-owned Data?

1. Decentralized Preservation of Cultural Knowledge

Indigenous languages are disappearing faster than they can be documented. By using Sia:

  • We store cultural data securely and immutably.
  • Consolidating required dataset for model training.
  • The project demonstrates how Web3 tools can protect heritage not just finance, and we hope the Foundation sees it’s value and potential.

2. Data Ownership for Indigenous Contributors

Chi is designed so that native speakers contribute voice recordings with full knowledge and consent — and their contributions are stored on Sia’s decentralized network.

This ensures:

  • Transparency: Contributors can verify and access the content they help create.
  • Autonomy: No corporation, government, or institution can lock or alter the cultural data once it’s on Sia.

3. Model for Future Decentralized Datasets

Chi Voice will serve as a replicable framework for other regions and cultures to follow.

By showing how Sia can power large-scale, ethically sourced voice datasets, we:

  • Encourage developers and researchers to use Sia for decentralized data hosting
  • Create momentum for a new standard of AI dataset sovereignty

Are you a resident of any jurisdiction on that list? No
Will your payment bank account be located in any jurisdiction on that list? No


Grant Specifics

Amount of money requested and justification with a reasonable breakdown of expenses:

Use of Funds: $35,000 requested

Item Amount (USD)
Developer fees (full-stack platform development) $32,000
Open AI (gpt5) API fees (estimated 25 tokens per output $75/m) $1,000
Storage fees (1year) $1,000
App platforms fees $125
Web+Server hosting (1 year) $375
Logo design $500

What are the goals of this standard grant? Please provide a general timeline for completion.

Our goals are:

  • Develop a mobile application for the Chi platform.
  • Improve the existing web-app, UI/UX and add interesting features.
  • Integrate Sia storage.
  • Develop a simple public API for developers access to recordings library

Month 1-2: Mobile App Development (Cross-Platform)

  • Define mobile-specific features and UI changes.
  • Implement UI design.

Month 2:

  • Integrate audio input/output for mobile.
  • Implement push notifications.
  • Add offline mode & language caching.
  • Connect with back-end.

Month 3:

Sia-Storage Integration

  • Integrate Sia storage
  • Develop a simple REST API to act as a public-facing portal for library access
  • Move and store existing and future Audio files via S5

Month 4: Test and Fixes

  • Optimize for performance and battery use
  • Internal QA and bug fixes.
  • Beta release to test group.

Month 5: App Release

  • App Store & Play Store listing setup.
  • Official app launch

Potential risks that will affect the outcome of the project:

  • Low Participation in Rare Languages
    Some indigenous languages may have few active speakers, limiting dataset diversity.

  • Poor Audio Quality
    Background noise or unclear recordings may affect usability of submissions.

  • Incorrect Language Labeling
    Users may misidentify dialects, leading to inaccurate metadata.

  • Internet Access Constraints
    Contributors in rural areas may face challenges uploading recordings due to weak connectivity.

  • Legal Risk
    Data Privacy & Consent


Mitigations

  • Partner with local communities, NGOs and language groups to drive targeted outreach.
  • Provide in-app audio quality checks and guides for optimal recording.
  • Use verification by native speakers and cross-check with multiple submissions of same language.
  • Enable offline recording with later upload when connectivity improves.
  • Obtain explicit user consent, provide clear terms of use and comply with data protection laws.

Development Information

Will all of your project’s code be open-source? Yes
Leave a link where code will be accessible for review:
https://github.com/Chi-voice/voice-seed-vault

Do you agree to submit monthly progress reports?
Yes — we will submit reports on our progress here on the forum.

Have you developed a proof of concept for this idea already?
Yes, it can be accessed at https://chivr.tech/


Contact info

Email: [email protected]

Hello!

So I had some questions about this project!

  • Why exactly would you use Sia? If we want to store language database for the current and future scientists, wouldn’t a more permanent solution be better, like ArWeave or some other perma-storage chain?

  • You have mentioned mainly full stack development in your budget, which as far as I know refers to website development. But then you mention mobile development in your timeline. Am I misunderstanding?

  • How do you plan to pay the speakers? It is no secret that people would be more willing to download your app and provide data, if you pay them. It will also mitigate the legal risk, since it would form a contract to use their voice.

  • What exactly do you intend to use GPT-5 for? And wouldn’t it be better to use a local LLM, so as to protect the data?

  • How exactly do you intend to store the user tables and all that? Since to make it truly decentralized, you need to store it in a Sia-powered database, and is there a product similar to it, or do you intend to make one?

Kudos!

Something like this does not exist and is not realistic right now to create. There was one db server that got a backend created that used object storage natively (Neon), but that an edge case in a way. There is also slatedb.io but that is more of a KV DB.

It is best to focus on decentralizing storage right now and not trying to decentralize everything. Been down that road and the tech isn’t ready as a whole.

Yes exactly, but if the intended metadata would be stored on a centralized provider, then the whole purpose of a decentralized storage would be lost. It would be similar to losing the boot sector of a CD, you have the entire data with you, but you can’t access it. Similarly, if the “evil guys” won’t be able to attack or control the data, they will just attack the metadata, losing the entire purpose

It can just be stored locally as well? At present the metadata for Sia in renterd is basically a middleman who sees everything. indexd changes a lot of that.

So, I think your looking at some stuff wrong regarding that. Anyways I would suggest you ask about these things in more detail, if you wish, once Kino sorts out your discord issues.

Lets not derail this grant request.

Kudos.

As I said, I had some questions, and this was one of them, not something that might be derail a project. If it might have been a point of issue, it would have been addressed. If it was not, I would learn something new, which happened to be the case. Anyway, thanks for the info!

Kudos!

1 Like

If you have criticism for this grant, feel free to share, but a debate/discussion about the database thing, IMO gets off topic to the grant request too much and is best for education/exploring in discord.

my 2C.

Hello @Princess -

I’m not seeing how this version of the proposal addresses the core issue raised most recently by the Committee and by other community members here, namely the suitability of Sia for this project.

If you’re able to address this by tomorrow Thurs. Sept. 25 by 10am EST then I’ll be able to present this version of your proposal at next week’s Committee meeting.

Hello @mecsbecs
I believe I was addressing the committee’s concern…

by adding the details of our business model that would cover for subsequent storage fees that’ll keep our data on the network.

If I missed the point, please elaborate.
Thank you

So, the thing is, its not just HOW you will store files in perpetuity, it is a question of WHY you would use Sia to store files in perpetuity, as far as I understand.

hello again @mecsbecs
I also haven’t been able to join the discord as the link appears to be invalid
can you add me up: princessij.

Please try The Sia Foundation.

Hello,

Please attempt to join the Discord server again.

Regards,
Kino

1 Like

Good day @mecsbecs
I was expecting a response to my question earlier but did not get any.
I have updated my grant with the details i believe is the answer to the question raised by the committee.

Hi @Princess - the follow-up was best encapsulated by @Gravity-3d’s response here from

Sep 25:

So, the thing is, its not just HOW you will store files in perpetuity, it is a question of WHY you would use Sia to store files in perpetuity, as far as I understand.

Which I’m still not seeing reflected in the above proposal. Also, it’s getting a bit difficult to track changes at this point, especially within the same proposal, so please outline any new additions in ‘Reply’ to your post for ease of reference.

Hello again @mecsbecs

I added this section

in response to the question of Sia being an appropriate fit.

How we intend to keep and maintain data on the network over time, is clearly stated in the proposal. Standard Grant: Chi-Voice

and I believe the answer to the question you quoted is in the proposal.
besides there’s no mention of an intent to “just” store data in “perpetuity” in the entire proposal as the question says.

I’d appreciate, if a few very active members of the forum give their 2 cents on if the proposal answers the questions. @Covalent @mike76 @pcfreak30 :)

to conclude, I hope my current proposal gets in the docket for review by the committee, so I can get closure.

So. Here are my thoughts:

  • This should probably start off as a small grant per the status quo of what the committee tends to request for new people and new ideas
  • I understand you are trying to respond to the perpetuity argument, but this is only being brought up due to your earlier proposal requests using similar terminology and ensuring you understand Sia ant arweave.
  • Can you clarify if this is intended to be a sort of centralize web2 SaaS/web service that uses Sia and content-addressed based storage?
  • I would recommend you not try to use S5 right now because the common perception with many proposals doing so is what is really the legacy implementation redsolver created about a year ago and has basically had at-least 1 do-over. I would personally not get involved in using S5 until at-least his Vup web is fully up-to date using it and there has been some beta testers. The legacy implementation is effectively abandoned and should not be used unless someone intends to fork the ideas and create their own P2P system.
  • In the same vein, If you want to use a CID-based storage rather then using renterd, which could be via fsd or other means, I would recommend using IPFS since there is a few paths with Sia now to making use of it, and it has a mature ecosystem.

So. my concerns are largely around the scope of the project and the status quo for that, and what data rails you plan to use to upload to Sia.

Kudos.