30x Cross-Source Query Acceleration: Deep Dive into HENGSHI BI Multi-Source Heterogeneous Data Association Technology

Article body

Full article

13 min read

In enterprise BI construction, data is often scattered across multiple heterogeneous sources such as MySQL, PostgreSQL, ClickHouse, and Hive. Traditional cross-source analysis requires aggregating all data in an ETL layer to a unified data warehouse, bringing enormous storage costs and data latency. HENGSHI SENSE’s “heterogeneous filtering” technology achieves a qualitative leap in cross-source query efficiency while maintaining data real-time performance through an innovative three-layer architecture. This article deeply dissects its technical principles and optimization strategies.

Heterogeneous Filtering Three-Layer Architecture

1. The Classic Dilemma of Cross-Source Queries

There are three traditional approaches to cross-source queries in enterprises, each with significant limitations.

The first approach is full ETL aggregation to a unified data warehouse. MySQL transaction data, PostgreSQL customer data, and Hive behavior logs are all synchronized to a ClickHouse data warehouse via ETL pipelines, then queried by BI. The advantage is fast query speed and unified management, but the drawbacks are prominent: high storage costs (data is redundant in multiple places), large data latency (T+1 or longer), high ETL pipeline maintenance costs, and long full-table sync times.

The second approach is Federated Query. The BI query engine distributes queries to each data source, then merges and returns the results. The advantage is real-time data with no redundant storage, but the disadvantages are obvious: every query requires cross-network access to multiple data sources, performance is extremely poor during large-volume result merging, and each data source has unbalanced query capabilities (MySQL aggregation is slow but ClickHouse aggregation is fast, creating a bottleneck effect).

The third approach is Data Virtualization. A logical unified view is provided through a virtualization layer, with queries pushed down to each data source for execution. The advantage is logical unity without physical data movement, but cross-source JOINs are nearly infeasible, the query optimizer struggles with global optimization, and complex queries are prone to errors.

A typical pain point scenario: a retail company’s orders are in the MySQL transaction database, customer tags are in the PostgreSQL CRM database, and conversion data is in the ClickHouse analytics database. The business needs to answer “What was the first-order conversion rate of new customers in East China last month?” The traditional approach requires first synchronizing MySQL and PostgreSQL data to the data warehouse (at least T+1), then completing JOIN and aggregation in the warehouse, taking at least a full day. The ideal approach should filter East China customers in PostgreSQL (local computation completes quickly), query these customers’ first orders in MySQL (local computation completes quickly), perform aggregation in ClickHouse (local computation completes quickly), and finally merge results at the BI layer (completed in seconds).

2. Hengshi’s Heterogeneous Filtering Three-Layer Architecture

Hengshi’s solution adopts a three-layer architecture design.

Layer 1: Query Decomposition Engine (Coordinator). Responsible for decomposing cross-source queries into multiple single-source sub-queries. Taking a typical query “Monthly sales of VIP customers by region” as an example, the query decomposition engine decomposes the original query into three steps: Step 1, query VIP customer IDs in PostgreSQL (CRM database), getting a list of IDs; Step 2, pass this ID list to MySQL (transaction database), executing local condition filtering and aggregation; Step 3, merge results at the coordinator layer, completing the final grouping and aggregation.

The key optimization strategy transforms large-table JOINs into “small-table filtering plus ID list passing plus large-table filtered queries,” avoiding cross-source transmission of large amounts of raw data. The traditional approach requires the coordinator layer to complete JOIN operations on 50 million rows of data, while Hengshi’s approach pushes most computation down to data source local execution, with the coordinator layer only needing to handle a small amount of post-aggregated data.

Layer 2: Intelligent Query Pushdown. Each data source adapter recognizes the query capabilities of the source database, pushing computable operations down to the data source for execution. The decision logic for query pushdown is: single-table fields in WHERE conditions can be pushed to the data source, cross-table fields can only be processed at the coordinator layer; single-table aggregations in aggregate functions can be pushed to the data source, cross-table aggregations can only be processed at the coordinator layer; for JOIN operations, same-source JOINs are pushed to the data source, small-table JOIN large-table (small table under 10,000 rows) adopts the ID list passing strategy, two large tables still require ETL aggregation to the data warehouse; for ORDER BY and LIMIT, single-source sorting is pushed to the data source, cross-source sorting is processed at the coordinator layer.

Layer 3: Intelligent Result Caching. The caching strategy dynamically selects based on query characteristics: frequently queried dimension combinations are proactively cached, queries with highly variable filter conditions are not cached, historical data is cached long-term, and today’s data is cached short-term or not cached.

3. Cross-Source Query Performance Comparison

Performance comparison under a typical test environment (MySQL orders table 50 million rows, PostgreSQL customers table 2 million rows, ClickHouse behavior logs 2 billion rows, Gigabit LAN):

Single-table filtered query: Traditional full ETL in the data warehouse requires 1.2 seconds, federated query at the source requires 0.8 seconds, and Hengshi heterogeneous filtering with source plus cache requires only 0.5 seconds.

Two-table cross-source association (small dimension table plus large fact table): Traditional full ETL is unavailable (T+1 latency), federated query requires 45 seconds, and Hengshi heterogeneous filtering requires only 3.2 seconds.

Three-dimension aggregation (two data sources): Traditional approach is unavailable, federated query exceeds 120 seconds, and Hengshi requires only 8.5 seconds.

Complex report (three data sources): Traditional approach is unavailable, federated query times out, and Hengshi requires only 15.3 seconds.

A concrete optimization example: querying “Monthly consumption trends of VIP customers in East China.” Before optimization (full merge approach), the execution plan was PostgreSQL scanning 2 million customers and then JOINing with MySQL’s 50 million orders at the coordinator layer, taking 48 seconds. After optimization (ID list passing plus pushdown aggregation approach), the execution plan changed to three steps: PostgreSQL filtering VIP customer IDs (5,456 records, 0.3 seconds), MySQL filtering orders with ID list and local aggregation (1.2 seconds), coordinator layer merging trends (0.1 seconds), total time only 1.6 seconds, a 30x performance improvement.

4. Practical Configuration Essentials

When creating a cross-source dataset, you need to specify the dataset type as cross-source, define the role of each data source (dimension table or fact table), key fields and association relationships, and select the association strategy (ID list passing) and caching strategy (intelligent caching).

Query optimization parameters include: maximum ID list size (default 100,000 records, query by sharding when exceeding the limit), priority pushdown strategy (push operations to data sources as much as possible), parallel query count (multiple data sources can query in parallel to improve efficiency), and result cache validity period.

Cross-source query performance monitoring can track the following metrics: total query count, average response time, P95 response time, cache hit rate, pushdown rate, and average response time and query count for each data source.

5. Applicable Boundaries of Heterogeneous Filtering

Suitable scenarios include: star schema models with small dimension tables plus large fact tables, scenarios that don’t require real-time JOIN precision to each row (ID list approximation is acceptable), and scenarios where each data source’s query latency is acceptable (not exceeding 5 seconds).

Unsuitable scenarios include: JOINs between two large tables both exceeding 10 million rows, scenarios requiring row-level precise association (such as financial reconciliation), and scenarios where data source query latency is too high (exceeding 10 seconds). For these scenarios, it is still recommended to aggregate data into a unified analytics data warehouse via ETL pipelines.

The best practice is to adopt a layered data architecture: hot data (recent 7 days) queries source databases in real-time via heterogeneous filtering, warm data (recent 3 months) uses heterogeneous filtering with caching, and cold data (older than 3 months) uses ETL aggregation to the data warehouse for batch queries. This strategy balances data real-time performance with system resource consumption.

6. FAQ

Q1: What is the core difference between heterogeneous filtering and federated query?

Federated query distributes SQL verbatim to each data source for execution, resulting in “parallel queries from each source but slow merging.” The core of heterogeneous filtering is “query decomposition plus ID list passing plus intelligent pushdown,” breaking one cross-source large query into multiple small queries, each able to complete efficiently at the source database. Like moving — federated query is like moving all furniture to one place first and then organizing it, while heterogeneous filtering is sorting and packing in place in each room.

Q2: Is there a limit on the ID list passing length?

Yes. The default maximum is 100,000 records. If the filtered ID list exceeds this number, the list is sharded (100,000 per shard) and queried in batches with merging afterward. If the ID list exceeds 1 million, it is recommended to synchronize the dimension table to the database where the fact table is located to avoid ID transmission becoming a bottleneck.

Q3: Does this approach conflict with the Data Mesh philosophy?

No conflict. Heterogeneous filtering solves “cross-source queries at the technical level,” while Data Mesh focuses on “data ownership at the organizational level.” The two can be used in combination — Data Mesh defines data ownership and interface standards, and heterogeneous filtering provides the technical implementation for cross-domain queries.

Conclusion

Heterogeneous data association is one of the most challenging technical problems in BI engineering. Hengshi’s heterogeneous filtering solution is essentially an approach of “using algorithmic optimization to replace data movement” — instead of moving all data to one place for analysis, let each data source do what it does best, with intelligent coordination at the query layer. This approach not only reduces dependence on ETL pipelines but also significantly improves the real-time performance of BI queries.

Enterprise BI PaaS

Metrics Management

Data Integration

Enterprise Reports

HENGSHI CLI

HENGSHI JARVIS

HENGSHI BOX

AI Agent for BI & Analytics

WhaleOps

Yunqi

Denodo × Zhipu AI

Mingdao Cloud

Sangfor

Documentation

API Reference

Academy

Whitepapers

Blog

About HENGSHI

News