Skip to content

In the previous article (enterprise agent trusted query issue), we talked about the importance of enterprise agent trusted query.

The current hot agents such as OpenClaw and Hermes are more suitable as personal assistants. The memory solutions in them are imprecise and fuzzy. Although a certain memory effect can be seen from the usage level, even if it is used for personal office, it cannot guarantee to provide accurate information. For example, if the user asks for a growth rate data, is the growth rate used by the agent the growth rate you think in your heart? Who set the calculation method? When was it dropped into the memory? Is the record accurate? Does it have an expiration date? Does it cover the previous growth rate definition? These problems can be difficult to alleviate even when combined with knowledge platforms such as Feishu.

Related questions include:

What data should AI trust? What does this data mean? Are the indicator calibers uniform? Is the inquiry controlled? Can the answers be explained, traced, and audited?

If these problems are not solved, the upper-level agent application can easily become "looks smart but is actually unreliable". If the intelligent agents used in enterprise production, such as Ask-Data, RAG, and algorithmic decision-making, want to obtain trusted data and then perform trustworthy execution, they must be based on the enterprise's trusted query layer (trusted data infrastructure).

Based on our understanding of this problem, we have done framework thinking and implementation practices, such as trusted data querying.

Over the past two months, we have developed a set of Ask-Data based on a trusted semantic layer. The semantic layer uses the cube framework. Cube (cube core) only provides the core implementation of headless. We have built a front-end function mapping, management configuration and query system. In the process, more than 10 billion tokens were consumed.

However, it should be noted that Cube solves the query problem of the semantic layer. It is aimed at structured data and can solve data queries of a unified caliber, but cannot completely solve the trust problem of the data itself. In addition, if other intelligent applications outside the data querying scenario require trusted query, they also need to have a trustworthy foundation. This foundation should be universal for different agents. For enterprises, it can be understood as a digital mapping of the enterprise's physical world, such as Palantir's system.

The common ontology+knowledge graph today also solves this problem. There are many related implementation solutions, each with its own merits. Below we take DataHub as an example, combined with Cube, to give us an understanding of how we should consider enterprise trusted data infrastructure and what the process is.

First, let’s talk about the positioning of DataHub and Cube respectively, and what problems they solve.

DataHub solves: how to precipitate and govern trusted data context.

It focuses on:

  • What data assets does the enterprise have?
  • What do these data assets mean?
  • who is responsible for the data;
  • Where data comes from and where it flows;
  • Is there any blood relationship?
  • Are there any quality issues?
  • Which tables are authoritative;
  • Which fields are sensitive fields;
  • Which business terms are officially defined.

So DataHub is more like an enterprise's data asset map and trusted context layer.

What Cube solves is: how to query data according to a unified standard.

It focuses on:

  • How to define indicators;
  • How to define dimensions;
  • How to join between tables;
  • What data can users check;
  • How queries are exposed through the API;
  • How to speed up high-frequency queries;
  • How an LLM or application can securely access metrics.

So Cube is more like Ask-Data's Semantic Query Layer.

One sentence to distinguish:

DataHub is responsible for “who to trust” and Cube is responsible for “how to verify”.

Why can't Ask-Data just rely on LLM to query the database directly?

Many Ask-Data systems will initially take a seemingly simple path:

User natural language question → LLM + retrieval context Generate SQL → Query database → Return results

This method is quick to make a demo, but once you enter the real (complex) environment of the enterprise, there will be many problems.

for example:

Different departments may have different calibers for the same "sales". An order form may contain payment amount, order amount, refund amount, and discount amount. When a user asks "Eastern China", it may refer to the sales area, the place where the customer is registered, or the location of the store. Some fields exist but are obsolete. Although some data can be checked, this user does not have permission to see it. Although some tables are named like formal tables, they are actually just temporary tables or intermediate tables.

At this point, if you let LLM face the database directly, it's easy:

  • Select the wrong table;
  • Using wrong fields;
  • join wrong path;
  • miscalculated indicators;
  • Bypass permissions;
  • Returns unexplained results.

Is it possible to use skills and other technologies to continuously evolve the data querying system through the user process? Skills are just workflows. Skills or workflows can precipitate common analysis steps, clarify processes and tool calling methods, but they cannot replace strong governance objects such as indicator calibers, business terms, permissions, lineage and quality status. In other words, skills can serve as upper-layer orchestration assets, but they still need to rely on a trusted data layer that can be accurately represented, audited, and governed.

Therefore, the focus of enterprise-level Ask-Data is not “making the model better at writing SQL”, but rather:

Allow models to work only within the confines of trusted, governed, interpretable data semantics.

This is the value of the DataHub + Cube combination.

What role does DataHub play in Ask-Data?

DataHub can be understood as the "trusted context center" for enterprise data assets.

It is not used to calculate indicators directly, but to help intelligent systems understand:

What data is there in the enterprise, which data is trustworthy, and what do these data mean?

In Ask-Data, DataHub can provide several types of key context.

Data asset context

For example, there are many tables in an enterprise:

ods_order dwd_order_detail ads_sales_summary tmp_order_2024 finance_revenue_monthly

When a business user asks "sales", the system cannot just pick a table.

DataHub can help identify:

  • Which watches are official production watches;
  • Which tables belong to the sales domain;
  • which tables are obsolete;
  • Which tables are authoritative tables being used by downstream reports;
  • Which tables have owners and descriptions.

This lets intelligent systems know:Which data to trust first.

business terminology context

The most difficult thing about Ask-Data is not field mapping, but business term mapping.

For example:

  • GMV;
  • net income;
  • active users;
  • repurchase rate;
  • High value customers;
  • Valid order;
  • Fulfillment Orders.

These words are natural to business, but not to databases.

DataHub's Business Glossary can precipitate these terms and bind them to the corresponding data assets, fields, indicators and responsible persons. If an enterprise synchronizes Cube indicators, BI indicators or custom Metric entities into DataHub, business terms can be further associated with specific indicator objects.

In this way, when the user asks "repurchase rate of high-value customers", the system can first know:

  • What is a high-value customer?
  • What is repurchase rate;
  • Which business domain these concepts belong to;
  • Which data assets correspond to it;
  • Is there an official definition?

Lineage and influence analysis

Ask-Data must not only be able to answer, but also be able to explain.

For example, a user asks:

“Where does this sales come from?”

The system should be able to answer:

This indicator comes from the sales operation data product. The bottom layer relies on the order fact table, payment details table and customer dimension table. The upstream data is synchronized by the trading system, and the downstream data is used by the business dashboard and monthly analysis reports.

This requires blood ties.

DataHub can record the flow relationship of data from upstream systems to intermediate tables, indicator tables, reports, and dashboards. This is important for trusted data querying, because enterprise users often not only want a number, but also know where the number comes from, whether it can be trusted, and which downstream applications it affects.

Owner, label and quality status

Ask-Data systems should also know:

  • Who is responsible for this table;
  • Who maintains this indicator?
  • Whether this field contains sensitive information;
  • Whether this data has been updated recently;
  • Are there any quality abnormalities?
  • Whether this asset is certified.

This information is not suitable for temporary processing in the LLM prompt, but should be stored in a metadata management platform like DataHub.

Therefore, the value of DataHub in Ask-Data can be summarized as:

Allow Agent to understand the trusted context of enterprise data assets before querying.

What role does Cube play in Ask-Data?

If DataHub tells the system "what trusted data should be used," then Cube is responsible for turning this data into "queryable, controllable, and reusable business semantics."

The core of Cube is the semantic layer.

It encapsulates the tables and fields in the underlying database into objects that the business can understand:

Bottom-level fields: orders.pay_amount orders.created_at customers.region customers.customer_level Cube semantic layer: sales number of orders number of customers repurchase rate region customer level order time

In this way, the Ask-Data system does not need to let LLM directly spell out complex SQL, but lets LLM call the indicators, dimensions and filtering conditions that have been defined in Cube.

Unified indicator caliber

For example, how should "sales" be calculated?

Is it the order amount? Is it the payment amount? Will a refund be deducted? Do you want to deduct the discount? Is tax included? Are only paid orders counted?

These calibers cannot be left to LLM for temporary judgment, but should be clearly precipitated in the Cube semantic layer.

Once a standard indicator is defined in Cube, all upper-layer applications access it through the same indicator:

Ask-Data BI Kanban Embedded Analysis Agent Algorithm Verification Platform

What they got was the same caliber.

Control join path

One of the most error-prone areas of enterprise databases is joins.

The order table can be associated with the customer table, product table, store table, channel table, and activity table. If the join granularity is not correct, it is easy to enlarge the number or miscalculate it.

Cube can pre-define which tables and how to associate them in the semantic layer to prevent LLM from guessing the join path.

This is very critical for Ask-Data.

Control permissions

If the Ask-Data system directly checks the database, it will easily bring permission risks.

for example:

  • Regional managers can only see their own regions;
  • Store managers can only see their own stores;
  • Financial data is only visible to finance;
  • Customer mobile phone number, ID number, address and other fields cannot be exposed to ordinary users.

Cube can combine access policies, row-level permissions, and user context to control the query scope.

In principle, enterprise-level Ask-Data should:

Agent cannot bypass Cube and directly check the original library.

All queries should go through the controlled semantic layer.

Provide stable API

Cube provides an API that allows the Ask-Data system to call indicators and dimensions in a structured way.

This is more stable than letting LLM generate SQL directly.

Ideally, the output of LLM is not:

select ...

Instead something like:

{ "measure": "Sales.revenue", "dimensions": ["Region.name"], "filters": { "Region.name": "East China" }, "time_range": "last_month" }

The real query is then generated and executed by Cube based on the semantic layer.

In this way, LLM is responsible for understanding and orchestration, and Cube is responsible for execution and governance.

How should DataHub and Cube cooperate?

DataHub and Cube can work together like this:

User question ↓ Ask-Data Agent ↓ Check DataHub first: understand terminology, data assets, lineage, quality, owner ↓ Check Cube again: call unified indicators, dimensions and permissions model ↓ Database / data warehouse ↓ Return results + caliber + source + trusted description

That is to say:

DataHub is the trusted context layer before querying. Cube is the semantic execution layer during querying.

A complete Ask-Data process example

Suppose the user asks:

“Has the repurchase rate of high-value customers in East China declined last month?”

A trusted Ask-Data system should not generate SQL immediately, but should follow the following process.

Step 1: Agent understands the problem

The system first disassembles the problem:

Indicator: Repurchase rate Target: High-value customers Region: East China Time: Last month Analysis intention: Whether it is declining

Step 2: Query DataHub

The system goes to DataHub to check:

  • What is the official definition of “repurchase rate”;
  • What is the definition of “high value customer”;
  • "Eastern China" refers to which business area caliber;
  • which data products are related to these terms;
  • Whether the related table is certified;
  • Whether the data has recent quality anomalies;
  • who is the owner;
  • Are there any historical query samples?

If ambiguity is found, the system should first clarify:

"The East China area you refer to here refers to the sales area in East China, or the customer registration place in East China?"

Step 3: Select Cube semantic object

The system selects the corresponding indicators and dimensions in the Cube based on the context returned by DataHub:

measure: Customer.repurchase_rate dimension: Customer.region filter: region = East China filter: customer_segment = high value customer time: last month

Step 4: Cube executes the query

Cube is based on the defined semantic model:

  • Use a unified repurchase rate caliber;
  • Use correct fact tables and dimension tables;
  • Take a controlled join;
  • Apply user permissions;
  • Execute query;
  • Return results.

Step 5: Agent generates credible answers

The final answer should be more than just one sentence:

"It's down."

Instead it should be:

Last month, the repurchase rate of high-value customers in East China was 23.6%, a decrease of 2.1 percentage points from the previous month. Explanation of caliber: Repurchase rate = Number of customers who made two or more purchases in the current period / Number of customers who made purchases in the current period. Data source: Customer business data product, the bottom layer relies on the order fact table and customer dimension table. Governance information: This indicator is maintained by the customer management team, and the relevant data assets are certified. No data quality anomalies were found in this query. Note: Statistics in East China are based on sales area.

This is trusted Ask-Data.

Several governance principles

If an enterprise really wants to make DataHub + Cube a trusted base layer, it needs to establish some hard rules.

Unauthenticated data does not enter Ask-Data

By default, production data querying only uses data that has entered DataHub, has been marked as owner, has not been discarded, and has certification or a clear trust level.

The default rules should be:

Data that has not entered DataHub is not allowed to enter Ask-Data. Data that is not marked with owner will not enter production data querying. Uncertified data is not used as the default answer source. The data has been discarded and is not allowed to be used by Agent.

Core indicators enter Cube

The core indicators should not be scattered in prompt words, codes, SQL, reports, and Excel.

It should be uniformly deposited into the Cube semantic layer and become a reusable and manageable indicator object.

Agent does not directly check the original library

Agents can understand the problem, invoke tools, and generate explanations.

But to really check the numbers, you should use Cube.

This avoids permission bypasses, caliber drift, and SQL hallucinations.

The answer is explainable

A trustworthy data querying answer should at least explain:

  • Index caliber;
  • data source;
  • time range;
  • filter conditions;
  • owner;
  • Data quality status;
  • Is there any caliber limit?

This is also the core value of using DataHub and Cube together.

DataHub + Cube is not just for Ask-Data

The value of this combination is not limited to ChatBI, it can also serve more intelligent applications.

to RAG

DataHub can tell the RAG system which documents, reports, and data products are more trustworthy. Cube can provide structured indicator results. In this way, RAG not only answers document questions, but can also be combined with real business data.

Quickly verify the algorithm

Algorithm Agent can first use DataHub to determine:

  • Whether the data exists;
  • Whether the data granularity is appropriate;
  • Is there historical data?
  • Whether the quality is normal;
  • who is the owner.

Then use Cube to obtain credible indicators and aggregated features to form algorithm verification data.

Ontology Network for Enterprises

In the future, if an enterprise builds a business ontology or knowledge graph, DataHub can provide data asset mapping, and Cube can provide indicators and query capabilities.

For example:

Business ontology: customers, orders, products, stores, warehouses DataHub: which tables and fields these entities correspond to Cube: how to calculate the indicators on these entities

It should be noted: If this system is to be run, it does have certain thresholds. It needs to be researched and tested based on the scenarios in the early stage. It also needs continuous maintenance after it goes online. Moreover, the maintenance itself is workload-intensive, so high-value core scenarios should be given priority.

Back to topic · Ask-Data Agents / Semantic Layer Previous: Trusted Enterprise Data Foundations: Architecture and Vendor Landscape

Building a long-term knowledge base for enterprise AI systems.