Skip to content

Previously I wrote a series of technical interpretation articles about Ask-Data:

Next, I plan to take an AI-native sample project "Amazon Product Selection and Competitive Product Review Ask-Data System for Individual Cross-Border Sellers" as an example, using public data to implement a semantically native Ask-Data system from scratch. This project will use "Ask-Data" as the first entry point, but the bottom layer is not simple NL2SQL, but a unified semantic layer based on Cube, allowing humans, AI Agents, Dashboards and subsequent machine learning models to access data querying data through the same set of business semantics. Due to the large amount of content, this project will be made into a series. Each article will correspond to a part of the open source content. As the series progresses, the complete code and documents will be gradually open sourced.

The open source code corresponding to this series has been released on GitHub: xuanagi/semantic-native-nlq-demo.

This project wants to verify several things:

  1. Ask-Data is not an isolated function, but a basic capability that business agents often need to have when implementing it. For a business agent to move from auxiliary analysis to business automation, it must first be able to correctly access data, understand a unified caliber, and allow humans to review which data, which indicators, and which evidence it based on which judgments it made.

  2. Many Ask-Data projects, especially open source projects, still focus on Text2SQL, but when it comes to actual implementation, what is more difficult is business semantics, indicator caliber and multi-entry consistency. The current project uses Cube to implement the semantic layer, which is more conducive to indicator management, caliber reuse and multi-entry consistency than direct Text2SQL. At the same time, Cube's caching and pre-aggregation capabilities also provide room for subsequent performance optimization.

  3. The value of the semantic layer is not just to serve human data querying. Whether it is human query, business agent invocation, Dashboard display, or subsequent machine learning feature generation, the same set of business semantics should be reused instead of being counted separately in different prompt words, SQL scripts, and notebooks.

  4. This is also a practice of the previously proposed "semantic native" path: the traditional path is to "treat chaos first and then cure it", while the AI-native path should be to "precipitate semantics while building".

1. This is not an ordinary Amazon review analysis tool

Before formally introducing the project, one point must be clarified:

This project is not about making another Amazon Review Analyzer.

Because there are already many mature products in this direction.

Amazon has its own Product Opportunity Explorer to help sellers understand information such as niche, search terms, clicks, purchases, reviews, and pricing; Amazon's Customer Feedback API also provides insights from customer reviews and returns, and explains that the information it provides is consistent with the information provided by the Product Opportunity Explorer in Seller Central and Vendor Central.

Tools such as VOC AI have also made Amazon review insights very product-oriented. The public page shows that it emphasizes analyzing a large number of Amazon reviews to help sellers discover product gaps, monitor brands, and understand consumer feedback.

Therefore, this project does not take "more complete data, more real-time monitoring, and more complete seller operation functions" as the first goal, but:

Using the specific business scenario of Amazon product selection and competitive product reviews, we made an open source reference implementation of semantically native Ask-Data.

This is the positioning of the current project.

In other words, existing tools mainly answer:

How can sellers analyze reviews, understand the market, and assist in product selection more efficiently?

What this project wants to answer more is:

If we want to make "data querying" into a system that is reusable, verifiable, and callable by Agents, how should the bottom layer be designed?

So it chose a business scenario that has been proven to be valuable, but what really needs to be settled is a set of reference implementation methods for semantically native Ask-Data.

2. Greater significance: trusted data access infrastructure for data agents

This project aims to demonstrate how a semantically native Ask-Data system should be implemented.

In the AI-native scenario, Ask-Data is not only a way for humans to look up numbers in natural language, but also a basic capability for the trustworthy operation of data intelligence agents. For an intelligent agent to move from auxiliary analysis to business automation, it must first be able to obtain correct data, understand a unified business caliber, and allow humans to review which data, which indicators, and which evidence it bases its judgment on.

Therefore, what this project really wants to show is: how to unify business objects, indicator calibers, query logic, evidence chains, and agent calls through the semantic layer, so that humans can perform data querying, Agents can be called, and machine learning can be reused, and all results are based on the same set of business semantics.

3. Why choose individual cross-border sellers and Amazon review scenarios?

Although there are already many Amazon review analysis tools, this scenario is still very suitable as a prototype for Ask-Data.

There are four main reasons.

1. Questions are specific and highly relevant to real economic value

The scenario of "individual cross-border seller" is specific enough, and naturally requires high-frequency analysis such as product selection, competing products, reviews, and listing optimization.

This group usually has very specific concerns:

Which category should I do? Which products have seen the fastest demand growth recently? Which competing products are selling well but users are not satisfied? In what areas are the negative reviews concentrated? What selling points should the listing highlight? What risks should be avoided in advance?

2. Data is relatively easy to publicly reproduce

Publicly available datasets like Amazon Reviews 2023 can be used. This data set was collected by McAuley Lab and contains user reviews, ratings, text, helpful votes, item metadata, descriptions, price, images and bought-together graphs; its 2023 version contains 571.54M reviews, 54.51M users, 48.19M items, 33 domains, and the time range is September 2023.

This means projects can be reproduced as open source without having to rely on real-time crawlers, private data, or third-party APIs.

3. Questions are naturally suitable for Ask-Data

Cross-border sellers don’t just want to read a report, but want to make decisions around product selection and competing products.

For example:

Which Beauty category has seen the fastest growth in reviews over the past 90 days? In what areas do negative reviews of competing products mainly focus? What are the most common selling points mentioned in positive user reviews? Which products have growing demand but low user satisfaction?

These questions are not simple SQL or plain text summaries, but require a combination of products, reviews, ratings, time, price bands, competing products, user pain points and evidence chains.

This is where Ask-Data fits in.

4. It can be naturally expanded to monitoring, early warning and assisted decision-making in the future.

Comment data has both structured parts and unstructured text, which is suitable for gradual introduction:

Pain point clustering Negative review anomaly detection Review growth prediction Opportunity score model

In other words, based on Ask-Data, it can be naturally combined with algorithms such as machine learning to further enhance the ability to assist decision-making. Subsequently, these capabilities can be called by Agents or business workflows to further form a closed loop of monitoring, early warning, and assisted decision-making.

It also shows that you need to think clearly from the first day:

Machine learning features should also be obtained from the semantic layer instead of writing a set of SQL for the training script.

4. Overall system architecture

The overall architecture of this project can be divided into six floors.

Public data / user imported data ↓ Data cleaning and standardization ↓ Data warehouse layer: products, reviews, review insights, product day indicators ↓ Cube semantic layer: indicators, dimensions, calibers, relationships ↓ Ask-Data layer: natural language → Cube Query → answer generation ↓ Agent /Dashboard/ML Pipeline

Conclusion

The first article first explains clearly the project background, positioning and overall structure.

Starting from the next article, we will enter the specific implementation: first prepare the sample data of Amazon Reviews 2023, and organize the products, reviews, ratings, prices, categories and other information into an analyzable data model. Later, we will gradually implement Cube semantic layer, Ask-Data interface, comment semantic enhancement, Agent calling and machine learning enhancement.

Back to topic · Ask-Data Agents / Semantic Layer 返回系列 · 语义原生智能问数系统落地实现系列: Next: Semantic-Native Ask-Data System Delivery (2): From Public Data to an Analysis-Ready Warehouse

Building a long-term knowledge base for enterprise AI systems.