Alexy interviews Slava Tykhonov at CODATA

Alexy and Slava in Amsterdam

The Semantic Layer for Reliable AI

Interview recorded in Paris, France, at the International Science Council and CODATA office.

Alexy Khrabrov speaks with Slava Tykhonov, Head of AI at CODATA, about the semantic and trust infrastructure needed for reliable AI. The conversation covers CODATA's global policy role, Croissant, CDIF, DIDs, verifiable credentials, ODRL, provenance, deterministic question-answer pairs, knowledge graphs, drift, multilingual vocabularies, open compute, and enterprise AI governance. All of these are also represented in the QueryGraph.ai OSS AI community project.

Alexy and Slava in Paris

Transcript

Alexy: Hello everybody, I'm Alexy Khrabrov, founder of the Community Research Center for Reliable AI at Northeastern University and head of community at Lakesail. I am here in Paris at the International Science Council and CODATA office with Slava Tykhonov, Head of AI at CODATA. Welcome, Slava. To start, please explain what the International Science Council and CODATA are.

Slava: CODATA is a non-profit organization that advises political bodies, governments, presidents, and organizations such as the World Health Organization and the United Nations on policy for data, artificial intelligence, and related areas. This work is supported by the government of France.

Alexy: You are based here in France, but CODATA operates globally, across Europe and around the world.

Slava: That's right. We are based in Paris, but our activities are global. They include Latin America, Australia, Africa, and other regions. We advise on global data and AI policy.

Alexy: Before CODATA, you were a researcher at the Royal Dutch Academy, and you were one of the authors of Dataverse, Croissant, Semantic Croissant, and CDIF. We will discuss all of this as a semantic layer for AI, or as you call it, navigation for AI.

Slava: It is a little more complicated than a single layer. First we created Croissant, a navigation layer for machine learning that was originally initiated by Google. Now we are building an extension that includes the Cross-Domain Interoperability Framework, or CDIF, which we use for semantics.

Slava: You can think of Croissant as a navigation device in a car. If you want to drive somewhere, you select the city, then the street name, then the building number. We do the same for AI, so the AI knows where to find resources. Once it arrives, it also needs to know what is available around that resource.

Slava: CDIF describes those relationships. It defines variables in detail, including units of measure, properties, hierarchies, and related information. On top of that, we add actionable policies. We use the W3C standard ODRL, the Open Digital Rights Language, and combine it with decentralized identifiers, DIDs, and verifiable credentials. That lets us see where every piece of information comes from. We have provenance, logs, and transparency.

Slava: With that infrastructure, we can build responsible AI applications for research and for many other AI-related activities.

Alexy: Let me unpack that. A lot of people working on agentic AI focus on agent interoperability, MCP, tool calling, and the functional side of agents: where agents go and what they do. They often take for granted that the data is already there and that the data is good.

Alexy: But data quality is the elephant in the room. If data has questionable provenance, and agents retrieve it without checking, they will produce garbage. Croissant, CDIF, DIDs, and ODRL come together to annotate datasets so we know what every column means. CDIF can say whether a number is Celsius or Kelvin, who produced it, and whether it can be trusted. If it is financial data, it should come from a trusted exchange before an agent acts on it.

Alexy: That is why this navigation layer matters. Agents can find the right data only if the data is properly described and trustworthy. DID and ODRL are established W3C standards. DID is implemented in Microsoft Azure and in Hyperledger-compatible systems. Croissant is being developed by you and others. So this is not a white paper; this is a working set of technologies.

Slava: Exactly. As everyone moves forward with agentic AI, we are trying to align our infrastructure with what is happening in Silicon Valley and elsewhere. People are building applications with cloud services, coding agents, tools, and APIs, but they often do not have Responsible AI infrastructure. The result is not reproducible. You may not know whether your application is based on wrong code or wrong data.

Slava: We are trying to bring in a layer of trust. We do this by working with the people responsible for standards and by defining every variable in a way that is available to every model, not only to Google, Anthropic, or another single provider. It should be a distributed layer of trust. A model can connect to that global infrastructure, retrieve the definition of a variable, and reduce randomness.

Alexy: So this is also a way to reduce non-determinism. One thing that impressed me is that you can wrap a prompt in a DID. A DID is a global identifier that can carry a payload, so you can attach prompts and skills. You can encrypt them, and middleware such as Llama can decrypt and execute them when it is allowed to do so by ODRL policy. Because the DID is a global URI, the middleware can find it. That gives you a way to impose determinism. How does that work?

Slava: We are introducing ground truths by freezing vectors, or tensors, inside LLM systems. A question-answer pair receives a digital signature and is signed by a human owner. You know who is responsible for it.

Slava: Once you have that key-value pair, it becomes deterministic. When you ask the same question, you get the same answer. That means you do not always need to run the full computation again. If you are using a local inference engine such as Llama, the model may run at 50 tokens per second because it has to compute the response. We precompute the relevant vector. This is not a cache in the traditional sense; it is a signed, reusable tensor with a similarity threshold.

Slava: If a new question is 95 percent similar to a question already answered, you can use the precomputed vector and compute only the missing part. That gives a significant speedup. Instead of waiting ten seconds for an answer, you may wait one second. Once a human agrees to sign the result, that knowledge can also be distributed. The tensor may be around 50 kilobytes. You can package it, send it to another inference engine, ask the same question, and get the same answer at the same speed.

Slava: In effect, we convert a non-deterministic answering machine into a deterministic one. The function is predefined, and it comes from a source you can trust.

Alexy: That is exactly what we need for AI right now. LLMs are non-deterministic, and we cannot rely on results that cannot be reproduced. Reproducibility is the hallmark of science. Non-reproducible things are not science. You are bringing this back into science, which is also CODATA's mission.

Alexy: This is useful for business as well. I want to emphasize that this can happen on the CPU. The cached values are obtained from middleware, and you have demonstrated very high token throughput because you avoid GPU computation whenever possible. That means enormous compute savings.

Slava: Anything that must be reproduced can be reproduced from the CPU if it was computed once on a GPU. That matters because compute is scarce, expensive, and energy-intensive.

Slava: We now produce CDIF variables once, add them to a registry, and distribute them without running the full computation again. We are moving from expensive GPU computation to a deterministic model where people can download updates and run them on their own computers. Because the knowledge is precomputed, they can use small models, even on mobile phones. AI becomes more democratic: researchers and individuals can install it and run it without needing powerful infrastructure.

Slava: The same applies to agents. Once we have distributed ground truth and complete provenance, we know who is responsible for each piece of information. We can connect agents, establish reliable baselines, and test what agents produce when they work together. We can put DIDs on agents, issue verifiable credentials, and see what each agent contributes.

Slava: If something goes wrong, you can inspect the logbook and find which agent received insufficient data, wrong information, or information from an untrusted source. You do not simply "fire" a bad agent; you can archive it, inspect its accumulated knowledge, validate what is still correct, and repackage the useful part into a new agent.

Alexy: This is Europe, so we cannot just fire agents. We have to upskill them and give them a new job.

Alexy: I also want to talk about drift. If you reduce non-determinism, then when something changes you can measure the difference. A question-answer pair can be timestamped and versioned. You can know which version of the agent, the LLM, and the infrastructure produced it, and you can know the identity of the human behind it. That gives you provenance for the toolchain that produced an answer. How do you detect drift when an LLM changes?

Slava: Drift is a well-known problem in knowledge engineering and knowledge organization. Knowledge is not static; it moves through time and space.

Slava: For example, if you want to understand how people thought about nuclear physics 100 years ago, you can capture knowledge from Einstein's books, package it, place it in an LLM context, and digitally sign it. The system knows that this is historical knowledge. You can add timestamps from different books and see how knowledge developed over time. That gives us a unique opportunity to track the delta: how knowledge changes and how we can reuse it.

Alexy: Suppose we start with Newton's Principia Mathematica. You scan it, put it into the system, and then compare it to Einstein. They have different views of physics. How do agents compare both answers?

Slava: This is where AI agents do the work. One agent may be responsible for named entity recognition: identifying people, dates, organizations, and other entities. That becomes the foundation of a knowledge graph. Another agent links that information to available knowledge in a knowledge base.

Slava: This is different from the usual LLM approach, where you capture all knowledge, put it into one large training set, and build a new model. Instead, we put an identifier on every claim. You can trace the source, the creator, and the context. The knowledge graph sits next to the LLM and supports reasoning.

Alexy: This matches an architecture that knowledge graph researchers have advocated for years: use an LLM to extract or phrase claims, but put the claims in a knowledge graph where they can be referenced, compared, and reasoned over. For example, under the concept of force, you could store what Newton said and what Einstein said, with references to the original sources.

Alexy: How do you decide whether to create a different node or compare two claims under the same concept?

Slava: This is where the archive is useful. A knowledge graph is a digital archive. Instead of packaging everything inside a model, you can query the graph with languages such as SPARQL or Cypher. An agent can recognize the topic, identify the relevant entities, query the graph, and retrieve what is available.

Slava: Some information may contradict other information. The system extracts facts and uses the LLM to help distinguish relevant from irrelevant claims, while the knowledge graph provides evidence and provenance. That is more powerful than constantly training new models. Instead of packaging the entire function into one large file, you connect to a distributed system that is alive. New knowledge is captured, processed, stored, and made available continuously.

Slava: Knowledge also moves across geography. The same fact may be understood differently in the United States, China, or Europe. With this infrastructure, you can ask how a concept is understood in different regions and compare the results. If you ask a current LLM, you usually get an approximation, and in politically sensitive areas the answer may be filtered by the model's local context or policy environment.

Slava: LLMs are not versioned in the way we need. You cannot easily ask, "What would you have answered yesterday?" or "What would you have answered before this event happened?" Knowledge graphs can support this, but even there temporal versioning must be designed carefully. You need timestamps on updates and a way to represent multiple values over time.

Slava: That is another reason we need a separate semantic layer. It helps with caching and performance, but it also supports archival use cases and drift detection. QueryGraph.ai brings these technologies together. It is built, open, containerized, and ready to use.

Alexy: We are speaking right after GOSIM AI Paris, the global open source AI event held at Station F. We saw presentations from leading companies and model makers, and we discussed the semantic layer with many of them. The response was very positive. Can you talk about open compute and the Academy of Artificial Intelligence, and the synergy between open compute and the semantic layer?

Slava: I was impressed by the quality of the presentations at GOSIM AI Paris. I was also impressed by what China is doing to support open source and open knowledge. I learned about a project called OpenC², which is building an operating system where knowledge is free and distributed.

Slava: It supports many kinds of chips, not only Nvidia. You can run software and models on Chinese chips, Korean chips, and many other architectures. I believe they already support around 30 chips. That also means you can connect your own laptop, whether it is Dell, HP, Apple, or something else.

Slava: You can run the operating system, join shared resources, and create a large computational resource from machines that people already own. Shared memory and shared storage can be used for serious work without creating a new data center. If a research institution wants significant compute, it can combine resources inside the institution.

Slava: That is open compute for open science and open source AI. The semantic layer is extremely useful in that environment because people must be able to find the datasets they need and trust them. Scientists are often resource-constrained, and they cannot always pay for large data centers.

Slava: Multiple-language support is also important. If you have AI you can trust, you can translate the same terms and vocabularies into different languages while preserving the same definitions. Temperature can be defined consistently in Chinese, French, and other languages, with the same units of measure and constraints. If something goes wrong, a digital policy can raise an alert when a value falls outside its expected range. With open compute deployed around the world, we can send signals through a global collaborative infrastructure.

Alexy: Finally, this same setup can be deployed inside a company. A corporation has data sources across departments. It needs the data to be in a trusted format that agents can understand, and it needs efficient computation. Everything we described can become an enterprise data platform for AI.

Alexy: AI navigation can happen inside the company. ODRL can define precisely which employee or department has access to which data. The data in this context contains the business knowledge of the company: prompts, skills, processes, trade secrets, and the mechanisms by which the company makes money and innovates.

Alexy: In a business context, tight control matters. DIDs let you trace what happened. If performance drops, profits drop, or there is a privacy or PII breach, you can track which agent accessed which data, which prompt was used, and which answer was received. Drift detection can help a business understand why a model is no longer performing as before, for example because an underlying LLM changed. All of this can run with open source LLMs on premises.

Slava: Exactly. This can run inside a data center, even without internet access. For factory automation or other sensitive environments, if something goes wrong because a new open source model was used, the company can trace what changed and respond with evidence instead of guesswork.