Project Name
Chi Voice (Pilot): Community-Collected Multilingual Audio Dataset
Nmae Of The Organization Or Individual Submitting The Proposal
Princess Innocent
Describe Your Project
Overview
Chi Voice is a lightweight pilot platform for collecting and organizing short audio recordings of underrepresented and indigenous languages. Native speakers contribute spoken translations of simple English prompts (words, phrases, or sentences), creating a structured, ethically sourced audio dataset for linguistic research and early-stage speech AI development.
Many languages in the Global South lack even minimal speech datasets, not because of complexity but because of tooling barriers. Chi Voice focuses narrowly on collection, labeling, and verifiable storage of audio samples, rather than attempting to solve long-term hosting, model training, or large-scale distribution in this phase.
This proposal funds a small, well-defined pilot that demonstrates how Sia can be used as a content-addressed archival backend for reproducible language datasets, while keeping application logic intentionally simple.
This grant focuses on hardening and formalizing the prototype into a clean, auditable dataset pipeline.
Who Benefits
- Linguists & language researchers – access to rare, labeled speech samples
- AI & NLP developers – bootstrapping data for low-resource languages
- Educational institutions – open datasets for study and preservation
- Language communities – representation and digital preservation of spoken heritage
Technical Scope
Architecture Philosophy
Chi Voice is intentionally designed as a centralized web service (Web2-style) that uses Sia as a decentralized, content-addressed storage layer, not as a fully decentralized application.
This keeps the system simple, auditable, and aligned with Sia’s technical reality.
Storage & Data Flow
-
Audio files are uploaded to Sia using the new S5 TypeScript gateway client as a thin integration layer.
-
This library provides a stable, developer-friendly interface for content-addressed uploads and CID resolution while delegating all storage guarantees and contract management to renterd.
-
Chi Voice does not rely on S5 for peer-to-peer networking or experimental protocol features; it is used strictly as an application-layer client for interacting with Sia-backed storage.
-
The application never claims perpetual or automatic storage guarantees
- Storage is budgeted, renewable, and explicit
Metadata Stored Per Recording
- Language name
- Language code (Glottolog)
- Prompt (word / phrase / sentence)
- CID (content-addressed pointer to audio)
- Optional transcription (when available)
- Timestamp
Public REST API (Read-Only)
A simple REST API exposes metadata and storage pointers:
Example query parameters
languagelanguage_codetype(word / phrase / sentence)date_range
Response
{
"language": "Babanki",
"language_code": "baba1266",
"prompt": "Good morning",
"cid": "sia://...",
"recording_text": "optional",
"created_at": "2025-01-12"
}
Researchers resolve audio files directly via CID through standard gateways.
The API remains read-only, reducing scope, cost, and operational risk.
API Access Model (Freemium)
Chi Voice provides API access using a lightweight freemium model:
Free Tier
- API key for all registered users
- Generous free quota (e.g. 1,000 requests/month)
- Designed for students, linguists, and small research projects
Paid Tier
- Higher request limits for institutional or commercial users
- Enables sustainable maintenance without restricting access to data
Enforcement
- API keys tracked per user
- Simple usage counters and quota enforcement
- No complex billing or on-chain logic in this phase
Why Sia Is the Right Fit (Pilot Framing)
Sia is used as a verifiable, content-addressed archival layer, not as a promise of perpetual storage.
- Each audio file is uniquely identified by its CID
- Researchers can verify dataset integrity independently
- Storage costs are predictable and budgeted in advance
- Contracts can be renewed transparently as the dataset grows
This approach mirrors how researchers already treat physical archives: explicit funding, explicit renewal, and auditability.
How Does The Project Serve The Foundation’s Mission Of User-owned Data?
1. Decentralized Preservation of Cultural Knowledge
Indigenous languages are disappearing faster than they can be documented. By using Sia:
- We store cultural data securely and immutably.
- Consolidating required dataset for model training.
- The project demonstrates how Web3 tools can protect heritage not just finance, and we hope the Foundation sees it’s value and potential.
2. Data Ownership for Indigenous Contributors
Chi is designed so that native speakers contribute voice recordings with full knowledge and consent — and their contributions are stored on Sia’s decentralized network.
This ensures:
- Transparency: Contributors can verify and access the content they help create.
- Autonomy: No corporation, government, or institution can lock or alter the cultural data once it’s on Sia.
3. Model for Future Decentralized Datasets
Chi Voice will serve as a replicable framework for other regions and cultures to follow.
By showing how Sia can power large-scale, ethically sourced voice datasets, we:
- Encourage developers and researchers to use Sia for decentralized data hosting
- Create momentum for a new standard of AI dataset sovereignty
Are you a resident of any jurisdiction on that list? No
Will your payment bank account be located in any jurisdiction on that list? No
Grant Specifics
Amount of money requested and justification with a reasonable breakdown of expenses:
Total Requested: $10,000
| Item | Amount (USD) |
|---|---|
| Developer fees | $8,000 |
| Open AI (gpt5) API fees (estimated 25 tokens per output $75/m) | $1,000 |
| Web Hosting & Storage fees (Just first 12 months) | $1,000 |
What are the goals of this small grant? Please provide a general timeline for completion.
Our goals are:
- Improve the existing web-app and user interface.
- Integrate Sia storage via the S5 typescript client
- Develop a simple public API for developers access to recordings library
Month 1
- Finalize data schema
- Integrate Sia storage via the S5 typescript client
- Generate stable CIDs
- Test storage integration
Month 2
- Build public read-only REST API
- Implement API keys + quota enforcement
- Test API functionality
Month 3
- Public dataset release
- Documentation and example queries
- Test and Release
Risks & Mitigations
-
Low participation in rare languages
- Targeted outreach and focused prompts
-
Audio quality variance
- Client-side recording guidance
-
Metadata errors
- Community review and duplicate sampling
-
Connectivity issues
- Short recordings and retry-friendly uploads
Development Information
Will all of your project’s code be open-source? Yes
Leave a link where code will be accessible for review:
https://github.com/Chi-voice/voice-seed-vault
Do you agree to submit monthly progress reports?
Yes — we will submit reports on our progress here on the forum.
Contact info
Email: [email protected]