A deep-dive into the APEXiA system — a production-grade compound AI stack that orchestrates multiple AI models, data pipelines, and business logic to run an end-to-end analytics, forecasting, and CRM platform for manufacturing. No data scientists required.
Compound AI Systems are architectures that combine multiple AI components — models, tools, orchestration layers, and data pipelines — into a coordinated whole that does more than any single model could.
The term was popularized by Andrew Ng and others in 2024–2025 as the frontier of practical AI engineering. The insight is simple but profound: no single LLM is good at everything. A system that chains specialized models — a classifier here, a SQL generator there, a forecaster somewhere else — coordinated through deterministic logic and self-correction loops, will outperform any monolithic prompt in reliability, accuracy, and cost.
Key principle: In a compound AI system, the value comes not from the individual components but from their composition. The architecture — how components interact, how errors are detected and corrected, how data flows between them — is the actual product.
Typical characteristics of compound systems:
Before compound architectures, the standard approach to building AI-powered applications was monolithic: write one elaborate system prompt and hope the LLM has enough context, enough reasoning ability, enough formatting discipline to do everything — classify, retrieve data, generate SQL, analyze results, and explain insights — in a single call.
This approach has fundamental limitations:
| Monolithic Prompt | The Reality |
|---|---|
| Everything in one system prompt | Prompt bloats past the model's effective context window; performance degrades non-linearly beyond ~5k-8k tokens of instructions |
| One model does classification + generation + analysis | LLMs are mediocre at both structured classification (low confidence) and complex SQL generation (hallucinated columns/joins); specialized prompts/models work better |
| No self-correction | If the SQL is wrong, the whole pipeline fails — there's no retry mechanism embedded in the architecture |
| No model specialization | Claude is great at SQL but slow/expensive for classification; Qwen is fast for classification but less reliable at complex queries. Using one model for both wastes money and performance |
| Black-box behavior | If results are wrong, you can't tell whether the failure was in intent understanding, SQL generation, data quality, or explanation |
The compound alternative: Separate concerns architecturally. Route each sub-task to the model best suited for it. Build deterministic error-detection into the pipeline. Make the architecture a first-class design artifact, not an afterthought.
APEXiA is not an experiment or a prototype. It is a production-grade compound AI system built by Ludwid Reyes for Harder SRL — a Dominican Republic construction materials factory — and designed from day one to be a template for multi-tenant deployment across dozens of other SMBs.
The system handles real business operations: inventory tracking, sales analytics, accounts receivable/payable, demand forecasting, churn prediction, and WhatsApp-based order intake. It runs entirely on a single box with two AMD Radeon AI PRO R9700 GPUs, serving a Qwen3.6-35B-A3B model locally via vLLM.
But more important than those numbers is how the pieces fit together. Below we decompose the entire architecture layer by layer, then zoom into the intelligence and orchestration mechanisms that make it all work.
Every compound AI system is only as good as its data layer. APEXiA's foundation is a carefully constructed ETL pipeline that mirrors a legacy Oracle database (called SIGAF) into a modern PostgreSQL instance.
Harder SRL's business operations — orders, inventory, payments — are managed in an Oracle database called SIGAF. This is a faithful, read-only mirror. The APEXiA system never modifies raw SIGAF data. All transformations happen at the view layer (Layer 1).
Hard rule embedded in the system: Raw tables (in schemas like cxc.*, fat.*, inv.*, cnt.*) are SIGAF-faithful mirrors. No UPDATE, INSERT, DELETE, or ALTER is ever applied to them. All transformations live in the ia.* (canonical) or bi.* (materialized views) layer. The ETL's TRUNCATE+INSERT refresh is the only legitimate raw-table mutation.
The entire stack runs on a single PostgreSQL instance (Docker container my-postgres, port 5432). Each tenant gets their own database, role, and read-only user. This is the tenant_NNNN pattern — privacy-by-design numbering that prevents tenants from enumerating each other via Postgres system catalogs.
The ETL runs in a separate Python environment (etl_env/venv) using scripts that connect to both the Oracle source and the PostgreSQL destination. The canonical migration orchestrator is promote_to_production.py, which handles schema drift detection and incremental column adds from SIGAF.
SIGAF is known to be incomplete in tenant_0001 — approximately 74% of Access-delivery clients don't appear in SIGAF sales, and many deliveries live off-book. The compound system accounts for this: the data layer doesn't pretend completeness, and downstream analytics are aware of the coverage gap. This isn't a bug; it's a known constraint that shapes how the AI interprets results.
Layer 1 sits between the raw SIGAF mirrors and the AI interfaces. It is the canonical abstraction layer — the ia.* schema — that makes multi-tenant scaling possible.
In a multi-tenant system, each tenant's source ERP looks different. SIGAF (the current source) has its own column names, table structures, and business conventions. Future tenants may use completely different ERPs. Layer 1 exists so that the AI layer never knows what ERP a tenant uses. It always talks to ia.v_ventas_detalle, ia.v_inventario_diario, etc. — columns and semantics that look the same regardless of what's underneath.
Sales transactions with product, client, date, margin
Daily inventory levels per SKU across all warehouses
Accounts receivable — customer balances and aging
Accounts payable — vendor obligations and aging
Operating expenses summarized by category and period
Demand forecast from IAxCientifico (XGBoost/Prophet)
Churn predictions with severity buckets and explanations
These views are 17+ in total, each mapping business concepts (sales, inventory, receivables) into a consistent column shape. The chatbot's few-shot examples, schema docstrings, and SQL prompts all assume ia.* is portable across tenants — only the ETL below it absorbs source-specific quirks.
Multi-tenant design principle: When adding a column or view shape, ask: "Would another tenant's data have this column under the same name?" If no, push the divergence into ETL, not the ia.* layer. This keeps the AI layer generic without needing tenant-specific routing.
If Layer 1 is the "language" that both the business and AI share, Layer 2 is the predictive intelligence engine. This is IAxCientifico — the AutoML system that continuously improves forecast and churn models using real data.
The demand model uses two algorithms in tandem:
Both are wrapped in a PL/pgSQL function (train_demand_model) running inside Postgres. The model trains on every scheduled run, producing predictions that land in the v_pronostico_demanda view — making them automatically available to Layer 3's chatbot.
Bug fix that proved critical: A runaway n_jobs bug in XGBoost (defaulting to all CPU cores) caused the model to hang for 63 minutes per run, pegging 30+ cores. The fix — adding 'n_jobs': 1 — brought training from 63 minutes to 17 seconds. This is a perfect example of why compound AI needs deep integration between components: the data pipeline (Postgres), the ML library (XGBoost), and the GPU model (Qwen) all interact through shared infrastructure that must be carefully calibrated.
The churn model went through a fundamental re-architecting. The original version predicted churn using a target that was recency-circular — it used features observed up to the cutoff point, but the target itself (whether a customer churns) was defined using post-cutoff behavior, creating leakage.
The fix reframed to a leading indicator: predict who goes dormant in the next 90 days using only features observed before a temporal cutoff. The result:
The leading GBM is a HistGradientBoostingClassifier that ingests activity/RFM windows, decline trends, product breadth/HHI, margin, and seasonality. It refuses to ship if it can't beat the recency-only baseline, and it produces per-customer explanations in Spanish (seller Spanish, the target audience). The model caught non-obvious drifters — clients 137 days silent on a 195-day cadence — that a simple recency rule would have missed.
This is where APEXiA becomes genuinely compound. IAxCientifico's AutoML system doesn't just train models — it proposes new features autonomously.
The Qwen LLM (via the same vLLM endpoint, :8011) proposes new features in a constrained DSL: windowed aggregates, ratios, and deltas over monto/n_prod/margen. Each proposal includes a description in Spanish and a justification.
Each proposed feature is evaluated on an out-of-time split. A threshold gate (AUC lift ≥ +0.002) determines whether it passes. This prevents circularity: features that just memorize the training window are rejected.
Both tiers fire end-to-end. Qwen and Claude each propose 5 features per run. In one validated run, all were rejected (closest: +0.00193, just under the bar). The registry tracks provenance — who proposed what, when, and the evaluation result.
Features that consistently fail evaluation get self-disabled (enabled=FALSE). The system cleans up its own feature registry, keeping only the useful ones. This is a feedback loop that compounds improvement over time.
Key insight: The feature proposer isn't just "generating random ideas." It operates within a constrained leakage-safe DSL. The proposed features must follow strict rules (windowed aggregates over pre-cutoff windows, evaluated only over that window). This is compound intelligence: the LLM proposes, the deterministic evaluator validates, the database records.
A fascinating design: the churn system doubles as a supply-chain procurement signal. Some finished items are manufactured only because one specific customer orders them (made-to-order demand). When that customer lapses, raw materials stop being procured. The resurrection model predicts which lapsed customers reactivate and when, enabling procurement planning with lead time awareness.
The detector identifies made-to-order items by ranking demand predictions by per-product WMAPE (Weighted Mean Absolute Percentage Error). The worst-forecast products are exactly the single-customer items:
This is compound AI at its best: the interaction between the demand forecasting system, the churn prediction system, and the procurement signal creates a business capability that none of the individual components could provide alone.
Layer 3 is the user-facing surface. It has two components that together serve all of Harder SRL's analytical and operational needs.
The flagship product. Users (sales reps, the factory owner, accountants) ask questions in Spanish about the business. The system translates that into SQL against the ia.* views, executes it, and returns a natural-language analysis in Spanish.
"¿Cuánto vendimos en mayo vs. abril?" or "¿Qué productos tienen inventario bajo?"
Emits INTENT:DOMAIN|CONFIDENCE|ALTERNATES — e.g. INTENT:VENTAS|HIGH (single domain). The classifier is itself a request to Qwen; vLLM prefix-caches each distinct system prompt, making repeated classification very fast.
CONFIDENCE=HIGH + single domain → fast path (scoped single-domain schema, one-shot SQL). Otherwise → hybrid path (union schema of primary + alternates, with self-correction retry on SQL failure).
One request to Qwen with the scoped schema docstring and few-shot examples. Temperature clamped to 0.1 for determinism. The SQL is cleaned (stripped of reasoning tags, markdown fences, anchors to last SELECT).
SQL runs against tenant_0001 with search_path = ia, public. Read-only validation uses the harder_user role; writes use postgres superuser.
Results fed back to Qwen for natural-language summary in Spanish. Temperature uses configured 1.0 for conversational tone. Results + interpretation returned to user.
| Feature | Standard (Qwen) | Premium (Claude) |
|---|---|---|
| Model | Qwen3.6-35B-A3B MoE (local, :8011) | Claude Sonnet (Anthropic API) |
| Architecture | Anchor Engine 2.0 (5-step pipeline: classify → route → SQL → execute → interpret) | Autonomous — Claude gets full schema + execute_sql tool, does everything in one shot |
| Cost | $0 (local GPU) | $0.018/message (est.) |
| Speed | ~100 tok/s single GPU | Variable, API-dependent |
| Self-correction | Yes — hybrid path has self-correct retry on SQL failure | Inherent — Claude can retry itself |
A complementary system that serves sales reps. Sales reps send orders via WhatsApp. The CRM parses them, manages seller cards, generates estado-de-cuenta (statement) PDFs, and pushes orders into the SIGAF system. This is compound in a different way: it combines LLM-powered intent extraction from WhatsApp messages with deterministic order processing and PDF generation.
Qwen parses WhatsApp messages to extract product codes, quantities, delivery dates. Deterministic validation follows to ensure all required fields are present.
Each seller gets a card showing their pipeline, recent orders, and account balance. Customers can request "estado de cuenta" — a PDF statement — which is generated on demand and delivered back via WhatsApp.
Validated orders are pushed into the Oracle SIGAF system via a dedicated connection. The push includes FECHAENTREGA, IMPUESTO (18% IBIIS), TASADECAMBIO fields, making the order fully operational in the legacy system.
The Anchor Engine is the routing core of the IAxAnalista system — the piece that makes the compound architecture work in practice. Without it, the system would just be a fancy SQL query builder. With it, the system has confidence-aware routing and self-correction on failure.
It operates on three signals from the classifier:
ia.* view schema to inject into the SQL promptThe six intents (CHAT, DATA, ANALYSIS, DATA+ANALYSIS, FOLLOWUP, FOLLOWUP+DATA) further determine whether additional follow-up context is loaded from a session cache (TTL 30 min, max 200 sessions, 3-entry results ring buffer for multi-turn sequences).
Why this is compound intelligence: The classifier doesn't just return a label — it returns a structured response that drives deterministic branching in the orchestrator. That orchestrator then assembles a scoped prompt, sends it back to the same model (Qwen), and if execution fails, automatically retries with a corrected schema. This is a feedback loop within a single API call. No monolithic prompt does this.
SQL generation failures are caught at execution time. When a query fails (column not found, table not found, type mismatch), the hybrid path triggers a self-correction retry:
Postgres returns an error (e.g. "column porc_margen does not exist").
The system has a _BLOCKED_COLUMNS registry (~5 top hallucination patterns) that maps common LLM hallucinations: margen_pct → porc_margen, monto_neto → monto_neto_rd, etc.
The corrected query executes. If it succeeds, the workflow continues to interpretation. If it fails again, the route falls through to the hybrid path with expanded schema.
This error-handling chain — parse, map, retry — is deterministic code, not a prompt trick. That's what makes compound AI reliable: the failure modes are understood, mapped, and handled programmatically.
One of the defining characteristics of a true compound AI system is using the right model for the right sub-task. APEXiA demonstrates this principle across four distinct AI models, each in its optimal role:
| Model | Role | Why This Model |
|---|---|---|
| Qwen3.6-35B-A3B (MoE, 3B active / 35B total) | Classification, SQL generation, natural-language interpretation | Runs locally on AMD Radeon R9700 GPUs via vLLM. Fast (~100 tok/s), cheap ($0 inference), large 262k context window. MoE architecture makes it efficient enough for real-time use. |
| Claude Sonnet | Premium tier alternative — autonomous SQL + analysis | Superior SQL generation on complex queries. Uses Anthropic API ($0.018/msg). Acts as a fallback for users who need premium-grade accuracy. |
| RapidXGBoost (HistGradientBoostingClassifier) | Churn prediction, demand regression | Tabular data specialists — far better than any LLM at structured regression/classification. Trained on panel data from the ia.* views. |
| Prophet | Seasonal demand forecasting | Time-series specialist for trend + holiday seasonality. Bulk-horizon prediction (predicts entire future window in one call). Used in tandem with XGBoost for ensemble advantage. |
Why 4 models instead of 1: A single LLM cannot be the best classifier, SQL generator, forecaster, and churn predictor simultaneously. Each sub-task benefits from a model specialized for it. The orchestration — deciding which model handles which part of the pipeline — is the compound intelligence itself.
The model diversity extends to the AutoML layer too: the feature proposer uses Qwen (Standard) and Claude/Opus (Premium), evaluating proposals against XGBoost and Prophet models trained on actual production data. This is a meta-learning loop: the LLM proposes features, the ML models evaluate them, the evaluation results inform future proposals.
Perhaps the most sophisticated aspect of APEXiA's compound architecture is the AutoML system's ability to improve itself autonomously. The feature engineering loop doesn't just train a model — it maintains a living feature registry that grows, prunes, and evolves as new data arrives.
Each proposed feature is recorded in cientifico.demand_feature_registry with:
When a feature consistently fails evaluation (below the AUC lift threshold), the system automatically sets enabled=FALSE. This means:
The system also tracks which features are tied to exogenous data sources (weather_BCRD) that haven't been wired in yet. These are queued for re-proposal when those data sources become available. It's a waiting queue of ideas that the system holds and revisits when the precondition is met.
Self-healing AutoML is the "compound" multiplier: Each AutoML run improves the feature set, which improves the model, which produces better demand forecasts and churn predictions, which feed back into the ia.* views, which the chatbot uses to give better answers. The system compounds improvements over time — this is the literal meaning of "compound AI."
Observability is critical in compound AI systems because failures can originate in any component. APEXiA incorporates monitoring at multiple levels:
| Monitoring Layer | Mechanism |
|---|---|
| API Health | /health endpoint returning version, active backends, classifier reachability, session count |
| Test Suite | ~130 tests across 11 suites (inventory, financial, mixed, multi, followup, forecast, churn, yoy, etc.). Post-ship regression gate. |
| Classifier Benchmarks | bench_classifier_v2.py — accuracy ≥95% format + ≥95% HIGH precision required on production |
| Throughput Benchmarking | apexia_benchmark.sh — per-user and aggregate tok/s sweep at various concurrency levels, prompt size parameters |
| AutoML Health | Bug #3 NOTICE capture, Aggregate Trials Revert (auto-disabled), self-healing checks, grain watchdog (count(distinct entity_key) == count(*)) |
| ETL Health | detect_stuck_runs() and cleanup_stuck_runs() functions in Postgres, n8n execution watchdog |
Notice there are no traditional dashboards for API monitoring. The health is checked programmatically via script execution. This is consistent with the compound AI philosophy: observability should be automated, actionable, and integrated into the pipeline, not something an engineer has to actively look at.
Key principle: automate the observability loop. When a test fails, the BUGFIX_QUEUE.md is updated. When an AutoML run stalls, the watchdog detects it. When a classifier degrades, benchmarking flags it. The system monitors itself.
APEXiA was designed from the start to serve multiple tenants — different companies, each with their own data, ERP, and business logic. The compound architecture makes multi-tenancy clean:
Because the ia.* schema is portable, onboarding a new tenant is an ETL problem only — the compound AI architecture itself doesn't need to change. This is what makes the system genuinely scalable.
Here's how APEXiA (compound) compares to a monolithic system that tries to do the same thing in a single LLM call:
| Property | Monolithic Approach | APEXiA (Compound) |
|---|---|---|
| Architecture | One giant system prompt (~10k+ tokens) | Modular: classifier → router → SQL-gen → execute → interpret (5 explicit stages) |
| Error handling | Retry with the same prompt — same failure mode | Deterministic self-correction: parse error → map fix → retry with corrected schema |
| Model used | One model for everything — mediocre at all tasks | Qwen for classification/SQL, Claude for premium, XGBoost for tabular, Prophet for time-series |
| Forecasting | LLM tries to predict numbers in a prompt — unreliable | Dedicated ML pipeline (XGBoost + Prophet), AutoML-proposed features, evaluation gates |
| Churn prediction | LLM analyzes past interactions — circular, leaky | Leading GBM with temporal cutoff, out-of-time validation, per-customer explanations |
| Multi-tenant | Tenant-specific prompt tweaks or separate giant prompts | Portable ia.* schema, ETL absorbs source differences, zero AI-layer changes |
| Self-improvement | Manually prompt-engineer better few-shot examples | AutoML proposes features, evaluates them, auto-disables failures, maintains registry |
| Observability | Hope the LLM formatted correctly | 130+ tests, classifier benchmarks, throughput benchmarks, stuck-run watchdogs |
| Cost | High — one expensive model doing everything | $0 for inference (local Qwen), Claude Premium opt-in for edge cases (~$0.018/msg) |
| Reliability over time | Drifts — few-shot examples rot, model capabilities shift | AutoML compounding features, view snapshots, CI/CD test gate before shipping |
The difference isn't just technical — it's philosophical. A monolithic approach treats the LLM as a universal problem-solver. A compound approach treats the LLM as one component among many, optimizing the overall system's reliability, cost, and correctness.
APEXiA's design decisions reflect a clear philosophy that shapes its compound architecture:
The default stack is entirely open source: Qwen model (local), vLLM (inference server), PostgreSQL (database), n8n (orchestration), Scikit-learn/XGBoost/Prophet (ML libraries). Paid APIs (Anthropic Claude) are a removable edge — used only when the local stack genuinely can't handle the task. This keeps costs near zero and prevents vendor lock-in.
All inference runs locally on the operator's own hardware: two AMD Radeon AI PRO R9700 GPUs with ROCm/vLLM serving. The model is Qwen3.6-35B-A3B-MXFP4 (4-bit quantized MoE), running TP2 across both cards with MTP speculative decoding (34.7 → 76.7 tok/s, 2.2× speedup). This means:
This inversion — local model as default, expensive API as opt-in — is the opposite of most AI startups. It reflects a pragmatic understanding: for a Dominican Republic SMB, cost predictability and data privacy matter more than squeezing out the last 3% of accuracy.
Infrastructure reality check: The second R9700 is currently a BIOS/firmware enumeration issue. The single-GPU stack is fully operational — no production impact, just a throughput cap. This is a hardware limitation, not an architectural one. The compound design survives even partial hardware degradation.
APEXiA isn't a toy project or a proof-of-concept. It's a live, production system serving real business operations with real data. It handles real decisions — procurement planning, credit risk, demand forecasting, sales strategy — generated by ordinary people asking questions in plain Spanish.
But what makes it genuinely noteworthy as an example of Compound AI isn't that it works (there are many working AI systems). What makes it noteworthy is how it's composed:
This is what Compound AI should look like in production. Not a chatbot with a fancy prompt — a coordinated system of specialized components, connected by deterministic logic, monitored by automated tests, and capable of self-improvement over time.
The bottom line: If you're building an AI application today, your architecture matters more than your prompts. Choose the right model for each sub-task, handle errors programmatically, build in self-correction, and don't try to do everything in one LLM call. Compound AI isn't a buzzword — it's the engineering practice of building systems that survive contact with the real world.
APEXiA proves this principle in action. It replaces what would have been a team of analysts, a data scientist, and a BI developer — and it does it while running on a single box with two consumer-grade GPUs, costing almost nothing in operational expenses, and serving the specific business logic of a Dominican Republic construction materials factory.
That's compound AI. Not theory. Not a research paper. Shipped software.
Built by Ludwid Reyes · APEXiA · Miami, FL — serving Latin America
v17.1-qwen-unified · 2026-05-31
— The system you're reading about is currently live and serving production traffic at localhost:8101 —