Small Grant: Chi-voice pilot

Project Name

Chi Voice (Pilot): Community-Collected Multilingual Audio Dataset

Nmae Of The Organization Or Individual Submitting The Proposal
Princess Innocent


Describe Your Project

Overview

Chi Voice is a lightweight pilot platform for collecting and organizing short audio recordings of underrepresented and indigenous languages. Native speakers contribute spoken translations of simple English prompts (words, phrases, or sentences), creating a structured, ethically sourced audio dataset for linguistic research and early-stage speech AI development.

Many languages in the Global South lack even minimal speech datasets, not because of complexity but because of tooling barriers. Chi Voice focuses narrowly on collection, labeling, and verifiable storage of audio samples, rather than attempting to solve long-term hosting, model training, or large-scale distribution in this phase.

This proposal funds a small, well-defined pilot that demonstrates how Sia can be used as a content-addressed archival backend for reproducible language datasets, while keeping application logic intentionally simple.

:link: https://chivr.tech/

This grant focuses on hardening and formalizing the prototype into a clean, auditable dataset pipeline.


Who Benefits

  • Linguists & language researchers – access to rare, labeled speech samples
  • AI & NLP developers – bootstrapping data for low-resource languages
  • Educational institutions – open datasets for study and preservation
  • Language communities – representation and digital preservation of spoken heritage

Technical Scope

Architecture Philosophy

Chi Voice is intentionally designed as a centralized web service (Web2-style) that uses Sia as a decentralized, content-addressed storage layer, not as a fully decentralized application.

This keeps the system simple, auditable, and aligned with Sia’s technical reality.

Storage & Data Flow

  • Audio files are uploaded to Sia using the new S5 TypeScript gateway client as a thin integration layer.

  • This library provides a stable, developer-friendly interface for content-addressed uploads and CID resolution while delegating all storage guarantees and contract management to renterd.

  • Chi Voice does not rely on S5 for peer-to-peer networking or experimental protocol features; it is used strictly as an application-layer client for interacting with Sia-backed storage.

  • The application never claims perpetual or automatic storage guarantees

    • Storage is budgeted, renewable, and explicit

Metadata Stored Per Recording

  • Language name
  • Language code (Glottolog)
  • Prompt (word / phrase / sentence)
  • CID (content-addressed pointer to audio)
  • Optional transcription (when available)
  • Timestamp

Public REST API (Read-Only)

A simple REST API exposes metadata and storage pointers:

Example query parameters

  • language
  • language_code
  • type (word / phrase / sentence)
  • date_range

Response

{
  "language": "Babanki",
  "language_code": "baba1266",
  "prompt": "Good morning",
  "cid": "sia://...",
  "recording_text": "optional",
  "created_at": "2025-01-12"
}

Researchers resolve audio files directly via CID through standard gateways.

The API remains read-only, reducing scope, cost, and operational risk.


API Access Model (Freemium)

Chi Voice provides API access using a lightweight freemium model:

Free Tier

  • API key for all registered users
  • Generous free quota (e.g. 1,000 requests/month)
  • Designed for students, linguists, and small research projects

Paid Tier

  • Higher request limits for institutional or commercial users
  • Enables sustainable maintenance without restricting access to data

Enforcement

  • API keys tracked per user
  • Simple usage counters and quota enforcement
  • No complex billing or on-chain logic in this phase

Why Sia Is the Right Fit (Pilot Framing)

Sia is used as a verifiable, content-addressed archival layer, not as a promise of perpetual storage.

  • Each audio file is uniquely identified by its CID
  • Researchers can verify dataset integrity independently
  • Storage costs are predictable and budgeted in advance
  • Contracts can be renewed transparently as the dataset grows

This approach mirrors how researchers already treat physical archives: explicit funding, explicit renewal, and auditability.


How Does The Project Serve The Foundation’s Mission Of User-owned Data?

1. Decentralized Preservation of Cultural Knowledge

Indigenous languages are disappearing faster than they can be documented. By using Sia:

  • We store cultural data securely and immutably.
  • Consolidating required dataset for model training.
  • The project demonstrates how Web3 tools can protect heritage not just finance, and we hope the Foundation sees it’s value and potential.

2. Data Ownership for Indigenous Contributors

Chi is designed so that native speakers contribute voice recordings with full knowledge and consent — and their contributions are stored on Sia’s decentralized network.

This ensures:

  • Transparency: Contributors can verify and access the content they help create.
  • Autonomy: No corporation, government, or institution can lock or alter the cultural data once it’s on Sia.

3. Model for Future Decentralized Datasets

Chi Voice will serve as a replicable framework for other regions and cultures to follow.

By showing how Sia can power large-scale, ethically sourced voice datasets, we:

  • Encourage developers and researchers to use Sia for decentralized data hosting
  • Create momentum for a new standard of AI dataset sovereignty

Are you a resident of any jurisdiction on that list? No
Will your payment bank account be located in any jurisdiction on that list? No

Grant Specifics

Amount of money requested and justification with a reasonable breakdown of expenses:

Total Requested: $10,000

Item Amount (USD)
Developer fees $8,000
Open AI (gpt5) API fees (estimated 25 tokens per output $75/m) $1,000
Web Hosting & Storage fees (Just first 12 months) $1,000

What are the goals of this small grant? Please provide a general timeline for completion.

Our goals are:

  • Improve the existing web-app and user interface.
  • Integrate Sia storage via the S5 typescript client
  • Develop a simple public API for developers access to recordings library

Month 1

  • Finalize data schema
  • Integrate Sia storage via the S5 typescript client
  • Generate stable CIDs
  • Test storage integration

Month 2

  • Build public read-only REST API
  • Implement API keys + quota enforcement
  • Test API functionality

Month 3

  • Public dataset release
  • Documentation and example queries
  • Test and Release

Risks & Mitigations

  • Low participation in rare languages

    • Targeted outreach and focused prompts
  • Audio quality variance

    • Client-side recording guidance
  • Metadata errors

    • Community review and duplicate sampling
  • Connectivity issues

    • Short recordings and retry-friendly uploads

Development Information

Will all of your project’s code be open-source? Yes
Leave a link where code will be accessible for review:
:link: https://github.com/Chi-voice/voice-seed-vault

Do you agree to submit monthly progress reports?
Yes — we will submit reports on our progress here on the forum.


Contact info

Email: [email protected]


Thank you for your proposal @Princess! This will be presented at next Tuesday’s Grants Committee meeting and a response will be posted here before the end of next week.

Something I would consider for evaluation is the focus on renterd here even though its indirect via S5. Ironically redsolver requesting a grant for Vup, which implicitly solves the indexd aspect for S5, means its somewhat moot.

But there are inner-ecosystem dependencies here that should be taken into account, and the grant should not focus on renterd if it cannot be adapted to indexd with minimal effort (and that might be a redsolver question?).

Lastly be sure focusing on S5 is with the TS and rust S5 iteration that resdolver & jules just completed, and not the legacy v0 version from 2024.

Kudos.

your insight is always appreciated.
we’re in the right ball park
I’ll be working with the S5 TS client jules developed and recently completed