Small Grant: Engram - Open Biosignal Dataset Preservation and Sharing on Sia

Introduction

Project Name: Engram - Open Biosignal Dataset Preservation and Sharing on Sia

Name of the organization or individual submitting the proposal: Jackie Tan, based in Singapore. Two-time Stellar Community Fund recipient (cross-border payments tooling, both delivered (link). Web3 developer since 2018; authored an L3 blockchain protocol. Serial entrepreneur in fintech and data science space - 4x founder. PhD in biology and data science. Currently an active brain-computer interface research building neural intent decoders for people with disabilities.

Personal site with list of 40+ applications built: apps — jackie tan yen

LinkedIn: https://www.linkedin.com/in/jackietanyen/

Describe your project.

Engram is an open-source platform for preserving and sharing scientific biosignal datasets, starting with EEG data, built natively on the Sia Storage SDK and indexd.

Two experiences motivated this. First: during my research, I contacted the author of a published electroencephalogram (EEG) dataset because the full recordings were missing from the repository. He told me GitHub couldn’t store files that large, so he uploaded only the processed output - the raw data, the only part useful for replication, was never published. It still doesn’t exist anywhere. Second: one day Yahoo unilaterally deleted a decade of my personal emails, i.e. correspondence with people who no longer exist, permanently gone. A platform made a business decision and my data went with it.

These aren’t edge cases. Neuroscience labs lose EEG recordings to server failures and institutional account lapses. Researchers publish processed summaries because no infrastructure exists for the raw files. Decades of irreplaceable human neural recordings disappear because there is no domain-specific, researcher-controlled archival layer for biosignal data.

Engram solves this: an ingest pipeline built on the Sia Storage SDK uploads electroencephalogram (EEG) files (EDF/BDF/GDF) via indexd; a metadata layer attaches provenance, recording parameters, and licensing at upload; a public read-only API lets researchers discover, cite, and retrieve datasets by object key. Storage is budgeted, renewable, and content-addressed and every object is identified by a cryptographic hash, making retrieval verifiable and tampering evident.

I will self-populate Engram with openly-licensed imagined speech EEG datasets from my own recording sessions as the first public corpus.

How does the projected outcome serve the Foundation’s mission of user-owned data? What problem does your project solve?

The Sia Foundation’s mission is user-owned data. Engram is a direct application of that mission to scientific research - a domain where the cost of platform-controlled data is uniquely high.

EEG datasets are generated by researchers but owned by institutions, stored on platforms, and subject to platform decisions. When a university’s storage contract lapses, when a cloud provider changes terms, when GitHub’s file size limit makes raw data unpublishable - the researcher loses access to data they created. This also means that the scientific community loses the ability to replicate or build on it.

Engram returns ownership to the researcher. Every dataset uploaded via the Sia Storage SDK is stored against indexd-managed contracts, identified by a cryptographic content hash, and retrievable by anyone with the CID without routing through a centralised gatekeeper, without dependence on any institution’s subscription, and most importantly - without any platform able to make a unilateral deletion decision.

Storage is explicitly budgeted and renewable, not promised as permanent; this is the same model researchers already use for physical archives. The difference is that Engram makes the ownership cryptographic rather than contractual, and the access public rather than institutional.

Are you a resident of any jurisdiction on that list? No

Will your payment bank account be located in any jurisdiction on that list? No

Grant Specifics

Amount of money requested and justification with a reasonable breakdown of expenses: US$9500

2 months of developer fees for @jackietanyen

What is the high-level architecture overview for the grant? What security best practices are you following? Please review our Development Guide for further details.

Engram is a client-side React application with no server components. A researcher arrives at Engram and signs in with their ORCID - the persistent digital identifier used across academic publishing.

On first sign-in, Engram generates a BIP-39 recovery phrase in the browser and derives a Sia App Key from it via @siafoundation/indexd-js. The App Key is stored in the browser’s secure storage; the recovery phrase is shown once and never persisted. The researcher’s ORCID is stored alongside their App Key as their public identity - every dataset they publish is attributed to a verified researcher record, not an anonymous account.

To upload a dataset, the researcher selects an EEG file from their local machine. Before anything leaves the browser, Engram reads the first 256 bytes of the file - the fixed-length EDF/BDF header - and zeroes the plain-text fields for patient name, date of birth, and recording date. This is a binary operation that requires no server call. The anonymised file is then uploaded directly to Sia, along with packed metadata objects containing the recording parameters, montage, task, equipment, and license. No EEG data ever passes through a server. The upload returns a content-addressed object key that verifiably identifies this exact dataset and researchers can cite it in published papers and the reference remains verifiable indefinitely.

The public discovery interface requires no account. On the public-facing side of Engram, anyone can search datasets by modality, task, or equipment and retrieve files by object key. A researcher can retract a dataset via the Engram dashboard at any time; Engram removes it from the discovery index and deactivates the object key immediately. Underlying Sia storage contracts expire on their natural schedule.

Security: the App Key never leaves the browser, the recovery phrase is shown once and never stored, and header anonymisation runs locally before any data is transmitted. The indexd Admin API is bound to localhost only. Engram has no database of user credentials, and the identity is the ORCID, storage access is the App Key, and both live with the researcher.

What are the goals of this small grant? Please provide a general timeline for completion.

Month 1

  • ORCID sign-in and first-visit onboarding flow
  • BIP-39 recovery phrase generation in the browser, App Key derivation via @siafoundation/indexd-js, App Key persisted to browser secure storage
  • Scrubbing of EDF/BDF headers to anonymise the data: client-side binary zeroing of patient name, date of birth, and recording date fields before upload
  • Dataset upload pipeline: anonymised file uploaded directly to Sia, packed metadata objects (recording parameters, montage, task, equipment, license) uploaded alongside
  • Object key returned and displayed to researcher with a copyable citation reference
  • Researcher dashboard showing their own uploaded datasets with object keys and pin state

tl;dr - a researcher should be able to sign in Engram and upload their dataset

Month 2

  • Dataset retraction: unpin via SDK, immediate removal from discovery index, object key deactivated
  • Public discovery interface: search by modality, task, equipment, and license; retrieve by object key with no account required
  • Self-population of Engram with @jackietanyen EEG data as the first public dataset
  • Documentation and README with build and self-hosting instructions
  • MIT license, public open-source release

tl;dr - a regular user should be able to find datasets on Engram and retrieve them

Who is the target user for your project?

The primary user is a neuroscience or BCI researcher who generates EEG datasets and needs a verifiable, citable, researcher-controlled home for their data that is independent of any institution’s infrastructure. This includes academic lab researchers, clinical neurophysiology teams, and independent BCI developers working with open-source hardware such as OpenBCI and Emotiv.

The secondary user is a researcher or ML engineer who needs open EEG datasets for model training, replication studies, or benchmarking and currently cannot find the raw recordings because they were never published, only the processed output.

What are your plans for this project following the grant?

Firstly, get user feedback and iterate the platform. Secondly, Engram will expand beyond EEG to accept other biosignal formats such as EMG, fNIRS, and MEG as the dataset corpus grows and community demand becomes clear. The ORCID attribution layer and content-addressed citation model generalise to any scientific dataset format without architectural changes.

Storage costs for the open corpus will be sustained through a freemium model: open and CC0-licensed datasets remain free to deposit, while researchers requiring private or embargoed storage prior to publication pay a nominal hosting fee. This keeps the open science layer free indefinitely while covering ongoing Sia contract renewals.

If the community response warrants it, a Standard Grant proposal will follow to build institutional depositor onboarding, bulk ingest tooling for existing lab archives, and a DOI minting integration so Engram datasets are directly citable in journal submissions.

Potential risks that will affect the outcome of the project:

Risk Impact Mitigation
@siafoundation/indexd-js API changes during development Upload/download pipeline breaks mid-build Pin SDK to exact version at project start; monitor SiaFoundation/web changelog weekly
indexd first-run sync and contract formation takes longer than expected Month 1 milestone delayed Use the hosted sia.storage indexer so no wallet funding, no sync wait, and no contract formation are required
ORCID OAuth integration complexity Onboarding flow delayed ORCID provides a standard OAuth 2.0 API with well-documented flows; scoped to read-only public profile only
EDF/BDF format variants across manufacturers Header anonymisation misses fields in non-standard implementations Test against files from OpenBCI, BrainProducts, Nihon Kohden, and EMOTIV; MIC has access to recordings from multiple hardware sources
Large file upload reliability in the browser Uploads fail or stall for files above 1GB Use streaming upload via SDK with resumable chunking; set explicit file size warnings in the UI
Dataset retraction window; Sia contracts expire on their own schedule Researcher expects immediate physical deletion Clearly communicated in the UI at point of retraction; framed as access revocation, not bit destruction
Cold-start: no datasets on launch Discovery interface appears empty Jackie self-populates with imagined speech EEG corpus on day one of Month 2

Development Information

Will all of your project’s code be open-source?

Yes. All code will be released under the MIT license. Engram has no closed-source components.

Leave a link where code will be accessible for review.

Repository: https://github.com/randomacy/engram

Do you agree to submit monthly progress reports?

Yes

Contact info

Email: [email protected]

Any other preferred contact methods: Telegram - @jackietanyen, Discord - @randomacy

Hi @jackietanyen - welcome to the Sia community! Thank you for your proposal.

Given you’re dealing with health data, what privacy measures will you be utilizing? Will the data sets be anonymized in public view? Will you be installing a privacy policy?

Note: please respond to the above by this Wednesday, June 3 at 5pm ET in order for this proposal to be considered for next week’s Committee meeting.

Hey @mecsbecs! Thank you for the warm welcome, and it’s my pleasure to contribute.

Appreciate the questions on privacy (it’s something I can talk about all day). But to summarize, four points on how we safeguard the privacy and anonymity of health data:

  1. Anonymization before upload
    Before any file leaves the browser, Engram zeroes the plain-text fields in the EDF/BDF file header such as patient name, date of birth, and recording date at the binary level. This runs client-side with no server involvement, so raw subject identifiers never transit the network at any point.
  2. What is publicly available
    More specifically, the public discovery interface exposes only researcher-supplied metadata such as recording parameters (sampling rate, channel count, electrode montage, device model), experimental task, and license. No subject-level information is stored or displayed. Datasets are attributed to the uploader’s ORCID (their public researcher identity) and not to any study participant.
  3. Researcher responsibility
    While Engram handles format-level anonymization automatically, researchers are also responsible for ensuring their datasets comply with the ethics approval and consent terms under which the data was collected. The model is consistent with existing open dataset repositories such as Mendeley and PhysioNet, which operate similarly.
  4. Privacy policy
    Yes, a privacy policy will be published at the Engram domain before launch, covering what data is collected (ORCID identifier and uploaded metadata), what is not collected (no raw subject data, no credentials), and how the retraction mechanism works. As part of the UX, the researcher will be prompted to acknowledge the privacy policy and confirm that their dataset complies with the ethics approval under which it was collected.

Thank you for the questions!

1 Like

Great, thank you for this thorough response @jackietanyen. This proposal will be reviewed by the Committee at next Tuesday, June 9th’s meeting and the response will be posted here before the end of next week.

1 Like

@mecsbecs Awesome, thank you so much. Looking forward to more questions and/or feedback. Have a great week ahead!

Thanks for your proposal to The Sia Foundation Grants Program.

After review, the Committee has decided to reject your proposal citing the following reasons:

  • This proposal relies on functionality that doesn’t exist in Sia.Storage i.e. content addressable storage.

  • The Committee is worried about the heavy regulations of the health industry and how this will impact the project.

We’ll be moving this to the Rejected section of the Forum.

Thanks again for your proposal, and you’re always welcome to submit new requests if you feel you can address the Committee’s concerns. Please bear in mind if you do resubmit, that we will require proof of experience building on Sia, which is a new requirement as of this week. Details can be found on the Grants webpage and reflected in the revised Small Grant proposal template.