Skip to content

I talked about the background and data warehouse of the semantic native Ask-Data system before. The previous article talked about why business objects, indicator calibers, segments, etc. should be solidified into a unified semantic layer that can be called consistently by humans, agents, and algorithm scripts.

This article starts by talking about how to perform queries based on a unified semantic layer. This article first focuses on its first consumption entrance:

How can human natural language data querying be stably mapped to this set of semantics?

In this position, many Ask-Data projects will tell the model the schema, business indicator definitions, formulas and concepts together, allowing it to generate query SQL. But the current project adopts a more engineering approach (the reasons will be explained in detail below):

Instead of letting the model freely generate SQL, the natural language problem is first converged into a layer of controlled query intermediate representation (IR); in the current implementation, this layer of IR specifically uses Cube Query.

The "controlled" mentioned here does not mean that the model decides how to query on the spot. Instead, the system first defines the available business objects, indicators, dimensions, filter conditions, time semantics and sorting methods, and then maps the natural language question into a set of limited and verifiable query structures.

For example, for the question:

Which beauty categories have seen the fastest growth in price tag reviews in the last 90 days?

The system breaks it down into several key parts required for controlled IR: which business objects to use, which indicators to look at, which dimensions to analyze, what filtering conditions to use, what time semantics to use, and what sorting method to use. What the model really needs to do is not to invent a query, but to map this natural language sentence to this set of defined query structures.

This article corresponds to the fourth step of the entire series: turning "unified semantics" into a "consumable data querying interface", and giving priority to ensuring stability, interpretability and verifiability at this stage.

1. Why not generate SQL directly?

When it comes to "natural language data querying", many people will choose to use NL2SQL by default. A common view is that as long as the semantic layer knowledge such as business indicator definitions, formulas and concepts is told to the model, the model can write SQL equivalent to controlled IR.

This consideration ignores one point: even if the business indicators and formulas are told to the model, if the final execution layer is open, the system still lacks verifiable boundaries.

As long as the model itself ultimately determines the specific query structure, it will still be able to make on-the-spot judgments on business object selection, time windows, filter conditions, aggregation granularity, selection of similar indicators, and sorting methods. And if you then use AST (Abstract Syntax Tree, such as the sqlglot package) to reversely parse the structured relationship from SQL, obtain the verifiable structure and perform verification, this intermediate process will be too troublesome compared to directly generating controlled IR.

So what the current project needs to solve here is not "let the model know the business definition", but "let the model first output a controlled query structure and then execute it by the system". In other words, what needs to be done is:

Whether the system can stably and reliably generate queries and answers that comply with unified business semantics.

This article is not pursuing "the model can ask anything", but pursuing "the system first solves the core set of questions stably".

2. The current system converges on high-frequency problem types

The project first converges on a set of problem types that have real business value and can be stably reused.

For example, the current system supports the following types of questions:

Which beauty categories have seen the fastest growth in price tag reviews in the last 90 days? In the last 90 days, which price ranges in the Beauty category have seen the fastest growth and an increase in negative reviews? Which products in the Beauty category have the most negative reviews with high helpfulness in the past 30 days? What are the products with the highest negative reviews in the last 30 days? What are the products with high review growth but low ratings in the last 90 days? What are the number of reviews and average ratings for different price bands in the Beauty category?

This is different from the common demo example prompt implementation mechanism in Ask-Data type systems. This is not to show "the model may be able to answer this", but to converge a set of analysis interfaces that can be reused for a long time and can continue to be provided to Agent, Dashboard and other portals for consumption.

After reading this, a natural question is: If only a set of high-frequency question types are currently supported, why do we need to do Ask-Data? Can't we just use the rule method? Isn't this unwise?

Regarding this issue, you can look at it this way: If the system only needs to answer a few types of completely fixed questions, then in many scenarios it will indeed be simpler to directly do rule interfaces, fixed reports or parameterized queries. However, it should be noted that we are converging on high-frequency question types, rather than a few rigid fixed questions; if the question types continue to expand, expression methods continue to increase, and entrances continue to increase, the maintenance cost of the pure rule method will rise rapidly, and the structure of "LLM is responsible for understanding the language, and controlled IR is responsible for constraint execution" will make it easier to move forward.

For example, users can ask the same type of questions:

  • In the Beauty category, which price with reviews has increased the fastest in the last 90 days?

  • Which price segment has seen the fastest growth in reviews in the Beauty category in the past 90 days?

  • Looking at the price range, which price range has the highest growth rate of reviews in the Beauty category in the past 90 days?

These three questions are different on the surface, but they can all fall into the same controlled query structure. If only a small number of questions are compatible, rule matching can certainly be done; but as question types, expression variants, business objects, and entries continue to increase, the maintenance cost of a pure rule solution will quickly rise.

There are many natural language variations, but eventually they converge to limited question types and limited query structures.

The value of "intelligence" is reflected in natural language understanding, intent classification, slot extraction and ambiguity resolution, rather than freely inventing queries.

3. Inquiry process

The data querying query process of the current project is as follows:

User question ↓ Identify intent ↓ Map to fixed business objects, indicators, dimensions and conditions ↓ Generate controlled IR (currently implemented as Cube Query) ↓ Execute and generate answers

Here, natural language does not directly determine SQL, but is first reduced to a formal intention by the system. What the system really wants to identify is not which expression the user used, but which business objects, which indicators, which dimensions, which filtering conditions, which time range and which sorting method behind this sentence.

Continuing with the previous example:

Which beauty categories have seen the fastest growth in price tag reviews in the last 90 days?

After this sentence enters the system, it will actually be restored to a structure similar to the following:

Core indicators: 90-day review growth rate. Auxiliary indicators: 90-day review increment, total number of reviews. Filter conditions: Category = Beauty. Time range: anchored to the end of the data and looking back 90 days. Sorting method: descending order by 90-day review growth rate.

Once the intent is reduced to this structure first, the data querying layer no longer needs to "invent the query", but only needs to reference the formally defined members from the semantic layer to form an executable controlled IR; in the current implementation, this IR is Cube Query, similar to the structure below.

{  "measures": [    "ProductDaily.commentGrowthRate90d",    "ProductDaily.commentGrowth90d",    "ProductDaily.commentCount"   ],  "dimensions": [    "ProductDaily.priceBand"   ],  "filters": [     {      "member": "ProductDaily.category",      "operator": "equals",      "values": [        "Beauty"       ]     }   ],  "timeDimensions": [     {      "dimension": "ProductDaily.date",      "dateRange": [        "<data_end_minus_89d>",        "<data_end>"       ]     }   ],  "order": {    "ProductDaily.commentGrowthRate90d": "desc"   },  "limit": 20 }

From the perspective of series implementation, this step is very critical. Because from here on, the system formally connects "free human expression" to "unified business semantics", and the middle does not rely on one-time constraints in prompt words or on-the-spot SQL assembly, but on controlled mapping.

4. What are the specific constraints of “controlled”?

“Controlled IR” is not an abstract slogan. When it comes to this current project, it restricts at least three things.

1. Which indicators, dimensions and conditions can be used are not determined by the model on the spot.

The output of the data querying layer is a controlled IR, which means that it can only reference members that the semantic layer has recognized.

That is to say:

  • The indicators that can be quoted are fixed

  • The dimensions that can be referenced are fixed

  • Reusable segments are fixed

  • The filtering conditions and sorting methods that can be used are also within the control range

The key boundaries behind this are:

The data querying layer does not own the business caliber. It is only responsible for mapping user expressions to the business caliber recognized by the semantic layer.

This is completely different from "deciding how to check on the model site".

2. Controlled IR is output, not free SQL

In this project, the final output of the data querying layer is not SQL, but a layer of controlled intermediate representation of the query.

This thing looks like technical selection, but it is essentially boundary design. Because once the final output layer is a controlled IR, the deterministic constraints on query execution become clearer: which members are referenceable, which time apertures are available, and which sorting and filtering methods are legal, can all be explicitly audited. Even if the model has seen the business definition before, in the end it can still only submit a structured request within the boundary, rather than a free execution instruction.

The biggest advantage of doing this is to separate the "fuzziness of natural language" from the "determinism of query execution". Natural language can of course have many expressions, but it must first converge into a limited set of formal structures before the system can actually execute.

3. Answer generation is not completely free either.

In many demos, after query execution, the last step will directly allow the model to be freely summarized.

This project is not completely open now, but will first adopt a simpler but more stable method: generating corresponding structured summaries according to the result types of different intents.

Although this step is relatively light, the direction is clear. Because for a semantically native data querying system, the answer is not just to “speak the results like human language”, it should also gradually possess the following capabilities:

  • Explain the core conclusion

  • Explain which indicators were used

  • Specify time range

  • Description filter

  • Give detailed results

  • Subsequently, evidence such as representative products and representative comments will be gradually accessed.

In other words, answer generation should not ultimately be a free summary module that is completely separated from the semantic layer, but should continue to share the same set of business interpretation boundaries.

5. So far, what has this data querying layer accomplished?

Now, this data querying layer has passed the minimum closed loop and has verified the following things. The relevant code will be open sourced to GitHub: xuanagi and www.xuanagi.com.

First, Ask-Data doesn't necessarily start with open NL2SQL. By first making controlled questions and controlled Planner, you can still run a real and usable business data querying closed loop.

Second, the unified semantic layer can be directly consumed by natural language. As long as the data querying layer does not bypass the semantic layer, but explicitly references the semantic members, a formal connection can be established between the natural language data querying and the semantic layer.

Third, the problem space can be gradually converged by engineering. Instead of pursuing “asking everything” at the beginning, we first clearly define the core set of questions, which will be more conducive to subsequent expansion.

In other words, here we have verified:

Humans can perform data querying stably based on unified semantics.

Conclusion

The core of this article is:

The core principle of Ask-Data is not to allow the model to freely generate arbitrary queries, but to allow the system to stably generate queries and answers that comply with unified business semantics.

At this point, the first four articles have basically put together the minimum closed loop for this implementation:

Public data ↓ Analyzable data warehouse ↓ Unified semantic layer ↓ Controlled natural language data querying

But if the system is just a question and answer interface that serves humans, it is still not much different from the ordinary Ask-Data demo.

So, the next article will answer:

Why should the same set of semantics continue to serve Agents, Dashboards, and ML Pipelines? If this cannot be achieved, what is the point of being “semantically native”?

Back to topic · Ask-Data Agents / Semantic Layer 返回系列 · 语义原生智能问数系统落地实现系列: Previous: Semantic-Native Ask-Data System Delivery (3): The Unified Semantic Layer Next: Semantic-Native Ask-Data System Delivery (5): How One Semantic Layer Serves Agents, Dashboards, and ML Pipelines

Building a long-term knowledge base for enterprise AI systems.