illuminAI

Conference 2026 Fall is in planning now, with agenda, speakers, and registration details coming soon.

On the Importance of Origins

by Jessica Tang

A piece about why we need AI provenance, context attribution, interpretability, and AI auditing, by Jessica Tang.

On the Importance of Origins

Collage by Jessica Tang

April 30, 2026

"Where Did This Come From?", in the Age of Synthetic Information

Where were my coffee beans grown? Who harvested them? Were they sourced ethically? To some people, coffee is "just coffee" — until you look closer. The same is true for clothing: where it's made, who made it, and under what conditions can completely change how we feel about the final product, even if the shirt looks identical on the rack. The more you dig, the more you realize origin matters, and that entire communities care deeply about it.

Now, people increasingly "consume" something else: model outputs. When generative models, such as large language models (LLMs), respond to you, you're not getting a single authored answer. You're getting a probabilistic output shaped by training data scraped from across the internet, fine-tuning based on human preferences, your prompt, and the system's underlying objectives. Two models trained on similar data can behave very differently depending on what they were optimized for: helpfulness, safety, engagement, sympathy, or policy compliance. Design decisions leave fingerprints on every answer, they're just invisible.

So: should we care where that output came from? If the model makes a decision, can we trace why? What objectives shaped its framing?

Most users will never ask. But even if most users never check the "origin story" of an output, trust often comes from knowing that the option exists; knowing that you could inspect it, audit it, or challenge it if needed. That inspectability (what researchers call AI provenance) is what separates systems that earn legitimacy from systems that merely perform it.

Why Provenance Matters

When AI shapes decisions in healthcare, hiring, or finance, opacity becomes a structural problem. The argument for provenance runs along four lines:

Trust depends on legibility. People accept systems they can, in principle, understand, not those they can't interrogate.

Accountability requires traceability. Provenance enables us ask not just what a model produced, but why, and whether it reflects the values its builders claimed to embed.

Governance needs audit infrastructure. At scale, detecting failures, investigate harms, and verifying compliance requires being able to trace behaviour back to its sources.

Copyright for ownership. Generative systems are trained on human-created work. If outputs are shaped by that creativity, the question of attribution and compensation reaches beyond philosophy and into law.

Research Pushing This Forward

Researchers are working to make AI provenance more tangible.

One line of work examines training data influence: which examples most shaped a model's behaviour? Techniques like influence functions (adapted for neural networks by Koh & Liang in 2017) try to trace model predictions back to specific training examples1. More recent work scales this to larger models2 3. In parallel, differential privacy research places formal limits on what models can reveal about their training data, balancing traceability with protection.

A second problem is context attribution: given a long prompt or conversation history, which parts actually drove the response? Cohen-Wang et al. (2024) formalized this question with ContextCite, a scalable method for pinpointing which parts of a context caused a model to generate a particular statement.4

Fairness research has been asking related questions for longer. Buolamwini and Gebru's Gender Shades (2018) showed precisely how training data composition produces unequal outcomes; not as an accident, but as a direct consequence of optimization choices5. The tools that work developed there are foundational to provenance research today.

The common thread across all of this is that AI systems and its outputs came from somewhere, built by someone, optimized for something. If AI is going to be embedded everywhere, then the question "where did this come from?" isn't optional. It's the foundation of accountability.

References

Footnotes

  1. Koh & Liang (2017), "Understanding Black-box Predictions via Influence Functions," ICML.

  2. Park et al. (2023), "TRAK: Attributing Model Behavior at Scale," ICML.

  3. Chang et al. (2025), "Scalable Influence and Fact Tracing for Large Language Model Pretraining," ICLR.

  4. Cohen-Wang, B., Shah, H., Georgiev, K., & Madry, A. (2024). "ContextCite: Attributing model generation to context." NeurIPS.

  5. Buolamwini, J., & Gebru, T. (2018). Gender Shades. FAccT.

Read More