Small Grant: ProofChain

(Pre S ig?: This is a Sia contained project, since there were doubts about the cross-chain integration, and I was told there would be a meeting next week)

Introduction

Project Name:
ProofChain: Verifiable AI Training with Sia

Name of the organization or individual submitting the proposal:
Hridyansh (Independent Researcher & Builder)

Describe your project.
ProofChain is a cryptographic provenance framework that enables AI developers to prove their models were trained on specific datasets. The system uses dataset hashing, Merkle commitments, and decentralized storage on Sia to anchor training data in a tamper‑proof, user‑owned environment. Training attestations (dataset hash, code hash, model weights hash) are then published as verifiable certificates, ensuring transparency and compliance without exposing sensitive data.

The MVP will demonstrate:

  • Dataset commitment and storage on Sia.
  • Training attestation generation.
  • A public registry of provenance certificates stored on Sia.

How does the projected outcome serve the Foundation’s mission of user-owned data?
This project directly advances the mission of user‑owned data by:

  • Ensuring datasets remain under the control of their owners, encrypted and sharded across Sia’s decentralized network.
  • Preventing centralized platforms from monopolizing AI provenance by anchoring proofs in a user‑owned, decentralized storage layer.
  • Empowering developers and organizations to prove compliance and transparency without ceding control of their data to third parties.

By making Sia the backbone of AI provenance, we extend its role from decentralized storage to decentralized trust infrastructure.

Are you a resident of any jurisdiction on that list?
No

Will your payment bank account be located in any jurisdiction on that list?
No

Grant Specifics

Amount of money requested: $9,500

Budget Breakdown

Category Amount (USD) Description & Justification Month
Development & Infrastructure $5,000 Covers developer time (coding, integration, documentation) and purchase of a dedicated server. The server will handle dataset hashing, Merkle tree generation, and computationally heavy attestations, while also serving as a long‑term asset for future Sia‑integrated products. 1st
Security & Code Review $2,000 Lightweight audit of cryptographic routines and Sia integration to ensure correctness, reliability, and credibility of the MVP. 2nd
Documentation & Demo Materials $1,500 Preparation of technical documentation, open‑source repository setup, and creation of demo materials (e.g., walkthrough video, usage guides) to support adoption and transparency. 3rd
Infrastructure & Miscellaneous $1,000 Sia storage contracts, bandwidth, test datasets, and incidental project costs. 3rd
Total $9,500

Goals & Timeline

General Timeline for Completion (3 months):

  • Month 1
    • Implement dataset hashing + Merkle root generation
    • Integrate Sia API for dataset commitment storage
    • Test retrieval and verification
  • Month 2
    • Build training attestation module
      • Dataset hash
      • Code hash
      • Model weights hash
    • Store attestations on Sia as provenance certificates
  • Month 3
    • Develop lightweight verification tool (CLI)
    • Prepare documentation and open-source release
    • Deliver demo materials (walkthrough video, usage guide)

Potential Risks

  • Performance constraints: Large datasets may increase hashing/storage costs.
  • Adoption barrier: Enterprises may hesitate to adopt new provenance standards.
  • Technical complexity: Integrating cryptographic proofs with ML workflows requires careful design.

Mitigation: Start with lightweight commitments + Sia anchoring, then expand to advanced proofs (ZKPs, TEEs) in later phases.

Development Information

Will all of your project’s code be open-source?
Yes. All code will be released under an MIT or Apache 2.0 license.

Link where code will be accessible for review:
Gravity-3d/ProofChain: A cryptographic proof provider for AI companies

Do you agree to submit monthly progress reports?
Yes.

Cloud vs. Dedicated Server Cost Comparison

Option Monthly Cost 12-Month Cost Notes
Cloud (1× A100 GPU instance) ~$2,500 ~$30,000 Cheapest “serious” GPU instance, but still 5× more expensive than owning hardware.
Cloud (8× A100 GPU cluster) ~$24,000 ~$288,000 Typical for large AI training; far beyond grant scope.
Dedicated Server ~$100 (power) ~$6,200 One‑time ~$5k purchase + electricity. Fully owned, reusable asset.

Conclusion:
A $5,000 dedicated server saves ~$25,000 over a year compared to the cheapest cloud GPU option, while aligning with Sia’s mission of user‑owned infrastructure.

Contact Info

Email: [email protected]
Other preferred contact methods: Discord: @x73d

Hi @Gravity-3d - thank you for this new proposal. Do you have any public GitHub repos for the Committee to review as proof of previous work since you’re a new applicant?

Hey!

Just a quick heads up, this is another proposal, not the one for StreamWeave. The StreamWeave proposal has been edited to fit the small grant template, but the title was uneditable. I will provide a repo link once I reach home.

Kudos!

Hi,

I took a look at your proposal and was wondering how the relationship between the training data and the model weights is actually proven? It seems to me that a producer of a model can put an arbitrary hash in the certificate (of another dataset for example). This seems important as you state:

Empowering developers and organizations to prove compliance and transparency without ceding control of their data to third parties.

Which seems to imply people can generate their own certificates?

Best Regards

No actually. We will run a Zero Knowledge proof algorithm on the dataset, which we will then match with the hashes generated with the AI training. For the prototype, I intend to use Merkle roots, since they are the simplest so easiest to experiment on. The dataset would be stored on Sia, and then run through a ZKP algorithm, to be sure that we did not see the dataset. Then, the AI training would be done in an enclave, like Intel TEE, where each training step, would generate its own hash, and then then the correct relation between the hashes of: -

  • Training Dataset
  • Initial AI weight
  • Final AI weight

would prove that the AI was trained on this dataset, and therefore a certificate would be issued

Hi,

Thank you for your answer, I still have some concerns however. It is not entirely clear to me what running a ZK algorithm on a dataset means, could you clarify this? In addition:

  • How are you planning to match the output of a ZK proof with hashes generated during AI training?

  • Have you thought about the performance considerations of running ZK proofs and TEEs?

  • TEEs, as far as I know, are only supported by CPUs and not GPUs. Which for ML feels like a major limitation, what are your thoughts on this front?

Personally, for your current plan I would like to see a feasibility study before doing anything along the lines of investing in expensive hardware. I think your effort would be better directed at designing an end-to-end flow, including how cryptographic components interact, then providing a very bare bones implementation. With this you can (i) show the setup works, and (ii) fits the target market with regards to performance/scalability/cost through benchmarks.

I would suggest to skip the Merkle tree based implementation. Given it is a different approach than the end-goal, and may have a chance of not sharing the same security and performance characteristics.

Let me know if you agree/disagree, I think it is an interesting idea but its complexity should not be underestimated.

Hi!

  • Right now, I am using a library called hashlib, which automatically creates proofs for the dataset.

  • While the exact proving algorithm uses advanced mathematics, on par with encryption, I will try to explain in simple terms. Suppose I have two variables, x and y, with values 3 and 4 respectively. Here, these are the dataset, and the model weights. Next, I create a random fact, like their product is 12, which is okay for anyone to know. These are the proofs of the dataset, and the weights, the Merkle root or equivalent. This alone isn’t enough for anyone to know what the actual dataset or the weights are. Next, I create an equation, like x + y - 7. This equation should give 0, if both the values are correct. Now, I check it. I input the known facts into the equation. If the correct answer comes out to be 0, I know both are correct, if not then screw you, cheaters. Now, the company can in fact reverse calculate the correct other answer, like they know y was 4, they can reverse engineer that x must be 3 and give the correct answer. But, in real life scenarios, these numbers are so incompressible, that it would be impossible to do so. And, a company that can somehow break the security of the proper level ZK algorithms, would have more money to make by selling the technology to break any encryption on the planet.

  • While there would be performance costs, the ability to prove their AI was trained on a good dataset would be worthwhile investment for companies. Like how car manufacturers incur the costs of sending their car to the NCAP tests, because it would mean a certificate of safety, which would make it easier to sway customers.

  • Yes, TEE is made by Intel for Intel GPUs, but I used it as an example for a secure environment. An AI company’s entire GPU network is secure.

Yes, I already did provide a basic proving, but the expensive hardware is required to properly create a viable MVP, so as to use the highest consumer grade ZKPs and AI training libraries. The circuits are incredibly complex and thus require a powerhouse to properly be made.

Edit: Kudos!

Thanks for your proposal to The Sia Foundation Grants Program.

After review, the committee has decided to reject your proposal citing the following reasons:

  • The Committee does not believe in supporting hardware purchases within grant proposals, and is not convinced a cloud server should not be used instead to develop a proof of concept through a Small grant.
  • The budget needs to be revisited, especially the audit value which appears to be too low.
  • The Committee would also like to see more substantial proof of previous work in your GitHub.

We’ll be moving this to the Rejected section of the forum. Thanks again for your proposal, and you’re always welcome to submit new requests if you feel you can address the committee’s concerns.