Small Grant: Engram - Open Biosignal Dataset Preservation and Sharing on Sia v2

Introduction

Project Name: Engram - Open Biosignal Dataset Preservation and Sharing on Sia

Name of the organization or individual submitting the proposal: Jackie Tan, based in Singapore. Two-time Stellar Community Fund recipient (cross-border payments tooling, both delivered (link). Web3 developer since 2018; authored an L3 blockchain protocol. Serial entrepreneur in fintech and data science space - 4x founder. PhD in biology and data science. Currently an active brain-computer interface research building neural intent decoders for people with disabilities.

Personal site with list of 40+ applications built: apps — jackie tan yen

LinkedIn: https://www.linkedin.com/in/jackietanyen/

Experience building on Sia: Pastepin, a decentralized Pastebin (GitHub - Randomacy/Pastepin · GitHub) - it demonstrates client-side encrypted upload/download via the Sia Storage SDK against the hosted sia.storage indexer, and object retrieval by key. Video demo is in the repository as well.

Describe your project.

Engram is an open-source platform for preserving and sharing scientific biosignal datasets, starting with EEG data, built natively on the Sia Storage SDK and indexd.

Two experiences motivated this. First: during my research, I contacted the author of a published electroencephalogram (EEG) dataset because the full recordings were missing from the repository. He told me GitHub couldn’t store files that large, so he uploaded only the processed output - the raw data, the only part useful for replication, was never published. It still doesn’t exist anywhere. Second: one day Yahoo unilaterally deleted a decade of my personal emails, i.e. correspondence with people who no longer exist, permanently gone. A platform made a business decision and my data went with it.

These aren’t edge cases. Neuroscience labs lose EEG recordings to server failures and institutional account lapses. Researchers publish processed summaries because no infrastructure exists for the raw files. Decades of irreplaceable human neural recordings disappear because there is no domain-specific, researcher-controlled archival layer for biosignal data.

Engram solves this: an ingest pipeline built on the Sia Storage SDK uploads electroencephalogram (EEG) files (EDF/BDF/GDF) via indexd; a metadata layer attaches provenance, recording parameters, and licensing at upload; a public read-only API lets researchers discover, cite, and retrieve datasets by object key.

I will self-populate Engram with openly-licensed imagined speech EEG datasets from my own recording sessions as the first public corpus.

How does the projected outcome serve the Foundation’s mission of user-owned data? What problem does your project solve?

The Sia Foundation’s mission is user-owned data. Engram is a direct application of that mission to scientific research - a domain where the cost of platform-controlled data is uniquely high.

EEG datasets are generated by researchers but owned by institutions, stored on platforms, and subject to platform decisions. When a university’s storage contract lapses, when a cloud provider changes terms, when GitHub’s file size limit makes raw data unpublishable - the researcher loses access to data they created. This also means that the scientific community loses the ability to replicate or build on it.

Engram returns ownership to the researcher. Every dataset uploaded via the Sia Storage SDK is stored against indexd-managed contracts and retrievable by anyone with the object key, without routing through a centralised gatekeeper, without dependence on any institution’s subscription, and most importantly, without any platform able to make a unilateral deletion decision.

Storage is budgeted and renewable, not promised as permanent. Every dataset is assigned a stable object key at upload and accompanied by a SHA-256 hash of its contents, computed client-side and stored as metadata, so retrieval can be verified against tampering or corruption.

Because the data involves human subjects, all subject-identifying header fields are stripped client-side before upload. They are never stored and never transmitted.

Are you a resident of any jurisdiction on that list? No

Will your payment bank account be located in any jurisdiction on that list? No

Grant Specifics

Amount of money requested and justification with a reasonable breakdown of expenses: US$9500

2 months of developer fees for @jackietanyen

What is the high-level architecture overview for the grant? What security best practices are you following? Please review our Development Guide for further details.

Engram is a React application using Supabase for account management and indexd via the Sia Storage SDK for all dataset storage. A researcher signs in via ORCID OAuth, handled through Supabase Auth. On first sign-in, Engram generates a BIP-39 recovery phrase in the browser and derives a Sia App Key from it via siafoundation/sia-storage. The App Key is encrypted client-side and stored against the researcher’s Supabase account, allowing recovery if the browser’s local storage is lost - the recovery phrase itself is shown once during onboarding and never persisted anywhere.

To upload a dataset, the researcher selects an EEG file from their local machine. Before anything leaves the browser, Engram reads the first 256 bytes of the file - the fixed-length EDF/BDF header - and zeroes the plain-text fields for patient name, date of birth, and recording date. This is a binary operation that requires no server call. The anonymised file is then uploaded directly to Sia, along with packed metadata objects containing the recording parameters, montage, task, equipment, and license. No EEG data ever passes through a server. The upload returns an object key that researchers can cite in published papers.

The public discovery interface requires no account. On the public-facing side of Engram, anyone can search datasets by modality, task, or equipment and retrieve files by object key. A researcher can retract a dataset via the Engram dashboard at any time; Engram removes it from the discovery index and deactivates the object key immediately. Underlying Sia storage contracts expire on their natural schedule.

Security: raw EEG data and the App Key never touch Supabase in unencrypted form since Supabase stores only the encrypted App Key, ORCID identity, and dataset metadata (object keys, recording parameters, file hashes). Header anonymisation and SHA-256 hashing run client-side before any data is transmitted to Sia. Supabase Row Level Security ensures researchers can only modify their own dataset records.

What are the goals of this small grant? Please provide a general timeline for completion.

Month 1

  • Supabase project setup: ORCID OAuth via Supabase Auth, dataset metadata schema, Row Level Security policies

  • ORCID sign-in and first-visit onboarding flow

  • BIP-39 recovery phrase generation in the browser, App Key derivation via siafoundation/sia-storage, App Key persisted to browser secure storage

  • Scrubbing of EDF/BDF headers to anonymise the data: client-side binary zeroing of patient name, date of birth, and recording date fields before upload

  • Dataset upload pipeline: anonymised file uploaded directly to Sia, packed metadata objects (recording parameters, montage, task, equipment, license) uploaded alongside

  • Object key returned and displayed to researcher with a copyable citation reference

  • Researcher dashboard showing their own uploaded datasets with object keys and pin state

tl;dr - a researcher should be able to sign in Engram and upload their dataset

Month 2

  • Dataset retraction: unpin via SDK, immediate removal from discovery index, object key deactivated

  • Public discovery interface: search by modality, task, equipment, and license; retrieve by object key with no account required

  • Self-population of Engram with @jackietanyen EEG data as the first openly-licensed public dataset

  • Documentation and README with build and self-hosting instructions

  • MIT license, public open-source release

tl;dr - a regular user should be able to find datasets on Engram and retrieve them

Who is the target user for your project?

The primary user is a neuroscience or BCI researcher who generates EEG datasets and needs a verifiable, citable, researcher-controlled home for their data that is independent of any institution’s infrastructure. This includes academic lab researchers, clinical neurophysiology teams, and independent BCI developers working with open-source hardware such as OpenBCI and Emotiv.

The secondary user is a researcher or ML engineer who needs open EEG datasets for model training, replication studies, or benchmarking and currently cannot find the raw recordings because they were never published, only the processed output.

What are your plans for this project following the grant?

Firstly, get user feedback and iterate the platform. Secondly, Engram will expand beyond EEG to accept other biosignal formats such as EMG, fNIRS, and MEG as the dataset corpus grows and community demand becomes clear. The ORCID attribution layer and object key citation model generalise to any scientific dataset format without architectural changes.

Storage costs for the open corpus will be sustained through a freemium model: open and CC0-licensed datasets remain free to deposit, while researchers requiring private or embargoed storage prior to publication pay a nominal hosting fee. This keeps the open science layer free indefinitely while covering ongoing Sia contract renewals.

If the community response warrants it, a Standard Grant proposal will follow to build institutional depositor onboarding, bulk ingest tooling for existing lab archives, and a DOI minting integration so Engram datasets are directly citable in journal submissions.

Potential risks that will affect the outcome of the project:

Risk Impact Mitigation
@siafoundation/sia-storage API changes during development Upload/download pipeline breaks mid-build Pin SDK to exact version at project start; monitor SiaFoundation/sia-storage-js changelog weekly
indexd first-run sync and contract formation takes longer than expected Month 1 milestone delayed Use the hosted sia.storage indexer so no wallet funding, no sync wait, and no contract formation are required
ORCID OAuth integration complexity Onboarding flow delayed ORCID provides a standard OAuth 2.0 API with well-documented flows; scoped to read-only public profile only
EDF/BDF format variants across manufacturers Header anonymisation misses fields in non-standard implementations Test against files from OpenBCI, BrainProducts, Nihon Kohden, and EMOTIV; MIC has access to recordings from multiple hardware sources
Large file upload reliability in the browser Uploads fail or stall for files above 1GB Use streaming upload via SDK with resumable chunking; set explicit file size warnings in the UI
Dataset retraction window; Sia contracts expire on their own schedule Researcher expects immediate physical deletion Clearly communicated in the UI at point of retraction; framed as access revocation, not bit destruction
Cold-start: no datasets on launch Discovery interface appears empty Jackie self-populates with imagined speech EEG corpus on day one of Month 2

Development Information

Will all of your project’s code be open-source?

Yes. All code will be released under the MIT license. Engram has no closed-source components.

Leave a link where code will be accessible for review.

Repository: https://github.com/randomacy/engram

Do you agree to submit monthly progress reports?

Yes

Contact info

Email: [email protected]

Any other preferred contact methods: Telegram - @jackietanyen, Discord - @randomacy

NB: This is a resubmission of an earlier proposal - Small Grant: Engram - Open Biosignal Dataset Preservation and Sharing on Sia

I will be addressing the feedback by @mecsbecs and the committee in the following reply

Thanks for the detailed feedback, @mecsbecs. I’ve revised the proposal to address both points directly.

On content-addressable storage: the original proposal incorrectly described Sia object keys as content-derived/cryptographic hashes. Apologies for that. The revised proposal now describes object keys as stable, application-assigned identifiers (consistent with how Sia Storage actually works), with integrity verification handled separately; Engram computes and stores its own SHA-256 hash of each file as metadata at upload time, which is checked against the retrieved file on download.

On health data regulation: Engram’s anonymisation model follows established de-identification practice in open neuroscience data sharing. Under HIPAA, de-identification can be achieved through removal of the 18 personal identifiers defined by the Safe Harbor method (link, page 97). For EEG, the relevant identifiers (patient name, date of birth, recording date) live in the EDF/BDF file header as plain text. Engram zeroes these fields client-side before any data leaves the browser, so anonymisation happens automatically at ingest rather than relying on manual review after the fact. This covers the structured identifier fields defined by EDF/BDF; it doesn’t by itself guarantee signal data can never be re-identified through advanced techniques, which is a general limitation of open biosignal sharing rather than something specific to Engram and is part of why the consent acknowledgement step exists.

Consent and ethics-approval responsibility sits with the researcher, same as any open data repository. Engram’s upload flow includes an explicit acknowledgement step confirming the dataset complies with the ethics approval under which it was collected.

For datasets not suitable for fully open release, the freemium model described in “plans following the grant” leaves room for a credentialed-access tier in future - fully open/CC0 datasets via the automatic anonymisation pipeline, with gated access as an option for datasets requiring additional protection. A privacy policy covering all of this will be published before launch as part of Month 2.

Additionally, on proof of Sia experience: since the original submission, I’ve built and shipped Pastepin ( GitHub - Randomacy/Pastepin · GitHub ), a decentralized Pastebin demonstrating client-side encrypted upload/download via the Sia Storage SDK against the hosted sia.storage indexer, with object retrieval by key. A demo video is in the repo.

Happy to answer any further questions on the revised proposal.