AI can’t scale without trust. Trust starts with the data layer

Thank you for reading this post, don't forget to subscribe!

The following article is a guest post and opinion of Johanna Rose Cabildo, Founder and CEO of Data Guardians Network (D-GN).

The Illusion of Infinite Data

AI runs on data. But that data is increasingly unreliable, unethical and tied with legal ramifications.

Generative AI’s growth isn’t just accelerating. It’s devouring everything in its path. OpenAI reportedly faced a predicted $7 billion bill in 2024 just to keep its models functional, with $2 billion in annualized revenue. All this was happening while OpenAI and Anthropic’s bots were wreaking havoc on websites and raising alarm bells about data usage at scale, according to a report by Business Insider.

But the problem runs deeper than costs. AI is built on data pipelines that are opaque, outdated and legally compromised. The “data decay” issue is real – models trained on unverified, synthetic or ‘old’ data risk becoming less accurate over time, leading to flawed decision-making.

Legal challenges like the 12 US copyright lawsuits against OpenAI and Anthropic’s legal woes with authors and media outlets highlight an emerging crisis: AI isn’t bottlenecked by compute. It’s bottlenecked by trustworthy data supply chains.

When Synthetic Isn’t Enough And Scraping Won’t Scale

Synthetic data is a band-aid. Scraping is a lawsuit waiting to happen.

Synthetic data has promise for certain use cases – but is not without pitfalls. It struggles to replicate the nuance and depth of real-world situations. In healthcare, for example, AI models trained on synthetic datasets can underperform in edge cases, risking patient safety. And in high-profile failures like Google’s Gemini model, bias and skewed outputs are reinforced rather than corrected.

Meanwhile, scraping the internet isn’t just a PR liability, it’s a structural dead end. From the New York Times to Getty Images, lawsuits are piling up and new regulations like the EU’s AI Act mandate strict data provenance standards. Tesla’s infamous “phantom braking” issue from 2022, caused in part by poor training data, shows what happens when data sources go unchecked.

While global data volumes are set to surpass 200 zettabytes by 2025 according to Cybersecurity Ventures, much of it is unusable or unverifiable. The connection and understanding is missing. And without that, trust – and by extension, scalability – is impossible.

It’s clear we need a new paradigm. One where data is created trustworthy by default.

Refining Data with Blockchain’s Core Capabilities

Blockchain isn’t just for tokens. It’s the missing infrastructure for AI’s data crisis.

So, where does blockchain fit into this narrative? How does it solve the data chaos and prevent AI systems from feeding into billions of data points, without consent

While “tokenization” captures headlines, it’s the architecture beneath that carries real promise. Blockchain enables the three features AI desperately needs at the data layer: traceability or provenance, immutability and verifiability. Each contribute synergetically to help rescue AI from the legal issues, ethical challenges and data quality crises.

Traceability ensures every dataset has a verifiable origin. Much like IBM’s Food Trust verifies farm-to-shelf logistics, we need model-to-source verification for training data. Immutability ensures no one can manipulate the record, storing critical information on-chain.

Finally, smart contracts automate payment flows and enforce consent. If a predetermined event occurs, and is verified, a smart contract will self-execute steps programmed on the blockchain, without human interaction. In 2023, the Lemonade Foundation implemented a blockchain-based parametric insurance solution for 7,000 Kenyan farmers. This system used smart contracts and weather data oracles to automatically trigger payouts when predefined drought conditions were met, eliminating the need for manual claims processing.

This infrastructure flips the dynamic. One option is to use gamified tools to label or create data. Each action is logged immutably. Rewards are traceable. Consent is on-chain. And AI developers receive audit-ready, structured data with clear lineage.

Trustworthy AI Needs Trustworthy Data

You can’t audit an AI model if you can’t audit its data.

Calls for “responsible AI” fall flat when built on invisible labor and unverifiable sources. Anthropic’s lawsuits show the real financial risk of poor data hygiene. And public mistrust continues to climb, with surveys showing that users don’t trust AI models that train on personal or unclear data.

This isn’t just a legal problem anymore, it’s a performance issue. McKinsey has shown that high-integrity datasets significantly reduce hallucinations and improve accuracy across use cases. If we want AI to make critical decisions in finance, health, or law then the training foundation must be unshakeable.

If AI is the engine, data is the fuel. You don’t see people putting garbage fuel in a Ferrari.

The New Data Economy: Why It’s Needed Now

Tokenization grabs headlines, but blockchain can rewire the entire data value chain.

We’re standing at the edge of an economic and societal shift. Companies have spent billions collecting data but barely understand its origins or risks. What we need is a new kind of data economy – one built on consent, compensation and verifiability.

Here’s what that looks like.

First is consensual collection. Opt-in models like Brave’s privacy-first ad ecosystem show users will share data if they’re respected and have an element of transparency.

Second is equitable compensation. For contributing to AI through the use of their data, or their time annotating data, people should be appropriately compensated. Given it is a service individuals are willingly or unwillingly providing, taking such data – that has an inherent value to a company – without authorization or compensation presents a tough ethical argument.

Finally, AI that is accountable. With full data lineage, organizations can meet compliance requirements, reduce bias and create more accurate models. This is a compelling benefit.

Forbes predicts data traceability will become a $10B+ industry by 2027 – and it’s not hard to see why. It’s the only way AI scales ethically.

The next AI arms race won’t be about who has the most GPUs—it’ll be about who has the cleanest data.

Who Will Build the Future?

Compute power and model size will always matter. But the real breakthroughs won’t come from bigger models. They’ll come from better foundations.

If data is, as we are told, the new oil – then we need to stop spilling it, scraping it, and burning it. We need to trace it, value it and invest in its integrity.

Clean data reduces retraining cycles, improves efficiency and even lowers environmental costs. Harvard research shows that energy waste from AI model retraining could rival the emissions of small nations. Blockchain-secured data – verifiable from the start – makes AI leaner, faster and greener.

We can build a future where AI innovators compete not just on speed and scale, but on transparency and fairness.

Blockchain lets us build AI that’s not just powerful, but genuinely ethical. The time to act is now – before another lawsuit, bias scandal or hallucination makes that choice for us.

Mentioned in this article

Source link