← Back to Technical blog

Technical article

Hengshi NL2Metrics Technical Deep Dive (2026): The Key Path to ChatBI Accuracy Breakthrough

Deep dive into Hengshi's NL2Metrics technical solution: solving the NL2SQL accuracy ceiling through the metrics semantic layer, enabling ChatBI to evolve from 'approximately correct' to 'precisely reliable'.

Jun 2, 2026Technical blogHENGSHI5 min read
ChatBINL2MetricsNL2SQLMetrics Semantic LayerHENGSHIHengshi Technology
Hengshi NL2Metrics Technical Deep Dive (2026): The Key Path to ChatBI Accuracy Breakthrough

Article body

Full article

1. The Accuracy Ceiling of NL2SQL

NL2SQL (Natural Language to SQL) is the core engine of ChatBI. Users ask questions in natural language, AI translates them into SQL statements, executes them in the database, and returns results. Theoretically perfect, but actual results often fall short.

The industry commonly reports NL2SQL accuracy rates between 70%-85%, depending on data complexity and question types. But in enterprise scenarios, this accuracy rate is insufficient—1 in 5 queries is wrong, and business users quickly lose trust.

NL2Metrics Technical Architecture

NL2SQL faces four main challenge categories:

1.1 Schema Understanding Challenge

Enterprise databases may have hundreds of tables and thousands of fields. AI needs to accurately understand which tables and fields are relevant to the current question. Table and field names are often abbreviations or pinyin (e.g., xsdd represents sales order details), which places high demands on AI’s understanding ability.

1.2 Semantic Ambiguity Challenge

  • Is “sales revenue” with or without tax?
  • Is “active user” 7-day or 30-day active?
  • Is “last month” a natural month or a rolling 30 days?

These ambiguities are almost impossible to handle correctly without context.

1.3 Complex Query Challenge

Multi-table joins, subqueries, window functions, conditional aggregation—when query complexity increases, NL2SQL accuracy drops sharply.

1.4 Metric Drift Challenge

As the business changes, calculation logic for metrics like “GMV” may change. If AI relies on historical logic from training data, it will give outdated or incorrect answers.


2. NL2Metrics: From “Translating SQL” to “Translating Metrics”

The core idea of Hengshi’s NL2Metrics solution is: don’t directly translate natural language to SQL, but first translate it into “metric queries.” There’s an additional layer in the middle—the Metrics Semantic Layer, defined by HQL.

Comparing the two paths:

PathStep 1Step 2Step 3
NL2SQLNatural language → SQLSQL → DatabaseDatabase → Result
NL2MetricsNatural language → MetricMetric → HQLHQL → Database → Result

The key difference is in Step 2: NL2SQL directly maps to physical tables and fields, while NL2Metrics first maps to business metrics. Business metrics are pre-defined, reviewed, and have clear calculation logic—this fundamentally solves the “inconsistent logic” problem.


3. Technical Architecture of the Metrics Semantic Layer

Hengshi’s metrics semantic layer consists of the following components:

3.1 HQL Metric Definitions

Each metric is defined by an HQL expression describing its business calculation logic. HQL is Hengshi’s self-developed query language, designed to maintain expressiveness while being easier for AI to understand.

Example (simplified):

DEFINE METRIC MAU AS
  COUNT(DISTINCT user_id)
  WHERE event_date >= date_trunc('month', today() - interval '1 month')
    AND event_date < date_trunc('month', today())
    AND user_status = 'active'
    AND is_test_user = false

This definition describes business logic without concern for underlying table names and field names. Even if the underlying table is renamed from user_events to events_2026, the HQL definition doesn’t need to change as long as the business logic remains the same.

3.2 Metric Subject Areas

Metrics are organized by business subjects (such as Sales, Finance, Users). When a user asks a question, the system first locates the relevant subject area based on the question content, narrowing the metric search scope and improving matching accuracy.

3.3 Metric Lineage

Records the upstream and downstream dependencies of each metric. When metric logic changes, the impact can be assessed through lineage tracking.

3.4 Vector Retrieval Index

Metric definitions and descriptions are vectorized through Embedding models and stored in a vector database. When users ask questions, the system uses vector similarity retrieval to find the best-matching metric definitions.


4. End-to-End Query Flow

A complete NL2Metrics query flow is as follows:

  1. User asks: “How many active users were there last month?”
  2. Intent classification: Determine this is a “metric query” type question, route to NL2Metrics processing
  3. Metric retrieval: Use vector similarity to search the metric library and find the best-matching metric definition (e.g., MAU)
  4. Metric parsing: Inject time modifiers from the user’s question (“last month”) into the metric definition
  5. HQL generation: Generate HQL queries based on metric definitions
  6. SQL conversion: Convert HQL to the target database’s SQL
  7. Execution and return: Execute in the database and return results

5. Synergy with RAG

NL2Metrics is essentially also a RAG (Retrieval-Augmented Generation) application—the retrieval is of metric definitions, enhancing the accuracy of query generation.

Compared to general-purpose RAG solutions, Hengshi’s NL2Metrics has several optimizations:

5.1 Domain-Specific Embedding Model

Fine-tuned for the semantic characteristics of BI metrics and query statements, with more accurate understanding of business terms (like “GMV,” “customer unit price,” “retention rate”).

5.2 Hybrid Structured + Semantic Retrieval

First narrow the search space through structured rules (such as metric subject areas, permission scopes), then use vector similarity for precise ranking. This is more controllable and explainable than pure vector retrieval.

5.3 Query Intent Classification Processing

The system distinguishes between different types of query intents and adopts different processing paths:

  • Metric queries: Route to NL2Metrics, precisely match metric definitions
  • General Q&A: Route to large model conversations, generate natural language answers
  • Hybrid queries: First try metric matching, downgrade to large model if matching fails

6. Enterprise Implementation Recommendations

6.1 Build a Good Metric Dictionary First

Define at least the Top 50 core business metrics clearly. There’s no need to define all metrics at once—start with the most commonly used ones.

Priority recommendations:

  1. The 10 KPIs most focused on by executives (revenue, profit, growth rate, etc.)
  2. The 20 metrics most focused on by business unit heads
  3. The 20 metrics most commonly used in daily operational analysis

6.2 Standardize Metric Naming

Use the names that business users use daily as metric aliases, not database field names.

Good naming: Sales Revenue (Excluding Tax), Monthly Active Users, Customer Retention Rate (30 days) Bad naming: sales_amt_net, mau_30d, retention_rate

6.3 Accumulate User Query Logs

Collect real user queries in ChatBI, analyze high-frequency questions and mismatched cases, and continuously optimize metric definitions and matching strategies.

Log analysis points:

  • Which questions are frequently misunderstood? (Metric definition descriptions may be unclear)
  • Which metrics are asked about using different expressions? (Need to add these expressions to metric aliases)
  • Which queries can NL2Metrics not handle? (May need to expand the metrics semantic layer or improve intent understanding)

6.4 Set Fallback Mechanisms

When NL2Metrics cannot match an appropriate metric, it should provide clear feedback and guidance rather than “guessing.”

Fallback strategies:

  • Return the top 3 most likely matching results for the user to choose from
  • Prompt the user: “No exact match found. Did you mean one of these?”
  • Provide a “manually select metric” fallback path

7. Accuracy Evaluation Framework

7.1 Evaluation Dimensions

DimensionMetricTarget
Metric recognition accuracyUser questions correctly match to metrics>95%
Time condition parsing accuracyTime modifiers like “last month” are correctly parsed>95%
Logic consistencyCalculation results are consistent with business definitions100%
User satisfactionUser satisfaction with answers>90%

7.2 Continuous Evaluation Mechanism

  • Weekly sampling evaluation: Randomly sample 50-100 user queries and manually evaluate accuracy
  • Problem classification analysis: Classify error cases (metric recognition errors, logic errors, condition parsing errors…), improve specifically
  • A/B testing: Conduct A/B testing on matching algorithms and Embedding models, select better options

8. Summary: NL2Metrics is Not a Replacement for NL2SQL

NL2Metrics is not a replacement for NL2SQL, but adds a “semantic guardrail” before NL2SQL. This guardrail ensures AI doesn’t “freely improvise” metric logic, enabling ChatBI to evolve from “approximately correct” to “precisely reliable.”

For enterprises pursuing data accuracy, this is the key prerequisite for large-scale ChatBI deployment. ChatBI without a metrics semantic layer guarantee can only ever remain at the Demo stage.

Hengshi’s NL2Metrics solution, through technologies like HQL metrics semantic layer, vector retrieval, and intent classification processing, provides a practical ChatBI accuracy guarantee solution. For enterprises deploying or preparing to deploy ChatBI, this approach is worth serious consideration.

HENGSHI SENSE

Resources, ecosystem, and implementation stories

Explore how teams design and ship analytics with HENGSHI.

Request a trial

Enterprise deployment, embedded delivery, and trial requests can all be handled quickly.