The Compound AI Architecture That Replaces an Entire Data Team

A deep-dive into the APEXiA system — a production-grade compound AI stack that orchestrates multiple AI models, data pipelines, and business logic to run an end-to-end analytics, forecasting, and CRM platform for manufacturing. No data scientists required.

Contents

  1. What Are Compound AI Systems?
  2. The Problem with Monolithic AI
  3. APEXiA: A Compound AI Architecture by Design
  4. Layer 0 — The Data Foundation (ETL Pipeline)
  5. Layer 1 — The Canonical Schema Layer
  6. Layer 2 — The Intelligence Layer (IAxCientifico)
  7. Layer 3 — The Interface Layer (IAxAnalista & IAxCRM)
  8. Layer 4 — The Orchestration Layer (n8n)
  9. The Anchor Engine: Compound Intelligence in One API Call
  10. Multi-Model Orchestration: Qwen + Claude + XGBoost + Prophet
  11. Self-Healing & Self-Retting AutoML Loops
  12. Observability Without Dashboards (Self-Monitoring)
  13. Multi-Tenant Privacy-by-Design
  14. Monolithic AI vs Compound AI: A Side-by-Side
  15. Design Philosophy: Open-Source-First, Local-by-Default
  16. Conclusion: Why This Matters for Everyone Building AI

01 What Are Compound AI Systems?

Compound AI Systems are architectures that combine multiple AI components — models, tools, orchestration layers, and data pipelines — into a coordinated whole that does more than any single model could.

The term was popularized by Andrew Ng and others in 2024–2025 as the frontier of practical AI engineering. The insight is simple but profound: no single LLM is good at everything. A system that chains specialized models — a classifier here, a SQL generator there, a forecaster somewhere else — coordinated through deterministic logic and self-correction loops, will outperform any monolithic prompt in reliability, accuracy, and cost.

Key principle: In a compound AI system, the value comes not from the individual components but from their composition. The architecture — how components interact, how errors are detected and corrected, how data flows between them — is the actual product.

Typical characteristics of compound systems:

02 The Problem with Monolithic AI

Before compound architectures, the standard approach to building AI-powered applications was monolithic: write one elaborate system prompt and hope the LLM has enough context, enough reasoning ability, enough formatting discipline to do everything — classify, retrieve data, generate SQL, analyze results, and explain insights — in a single call.

This approach has fundamental limitations:

Monolithic Prompt The Reality
Everything in one system prompt Prompt bloats past the model's effective context window; performance degrades non-linearly beyond ~5k-8k tokens of instructions
One model does classification + generation + analysis LLMs are mediocre at both structured classification (low confidence) and complex SQL generation (hallucinated columns/joins); specialized prompts/models work better
No self-correction If the SQL is wrong, the whole pipeline fails — there's no retry mechanism embedded in the architecture
No model specialization Claude is great at SQL but slow/expensive for classification; Qwen is fast for classification but less reliable at complex queries. Using one model for both wastes money and performance
Black-box behavior If results are wrong, you can't tell whether the failure was in intent understanding, SQL generation, data quality, or explanation

The compound alternative: Separate concerns architecturally. Route each sub-task to the model best suited for it. Build deterministic error-detection into the pipeline. Make the architecture a first-class design artifact, not an afterthought.

03 APEXiA: A Compound AI Architecture by Design

APEXiA is not an experiment or a prototype. It is a production-grade compound AI system built by Ludwid Reyes for Harder SRL — a Dominican Republic construction materials factory — and designed from day one to be a template for multi-tenant deployment across dozens of other SMBs.

The system handles real business operations: inventory tracking, sales analytics, accounts receivable/payable, demand forecasting, churn prediction, and WhatsApp-based order intake. It runs entirely on a single box with two AMD Radeon AI PRO R9700 GPUs, serving a Qwen3.6-35B-A3B model locally via vLLM.

6
IAx product lines (Dash, Analista, CRM, Cientifico, OR, DBA)
4
AI models in production (Qwen, Claude, XGBoost, Prophet)
17+
Canonical database views in the ia.* schema
~130
Automated tests across 11 test suites
$0
Inference cost (Qwen runs locally on GPU)
2.2x
Throughput gain from MTP speculative decoding (34.7 → 76.7 tok/s)

But more important than those numbers is how the pieces fit together. Below we decompose the entire architecture layer by layer, then zoom into the intelligence and orchestration mechanisms that make it all work.

APEXiA Compound Architecture Overview
Layer 4 — Orchestration
n8n workflows: AutoML loops, weekly exec summaries, ETL scheduling
Layer 3 — Interface
IAxAnalista (NL-SQL chatbot), IAxCRM (WhatsApp order intake)
Layer 2 — Intelligence
IAxCientifico: XGBoost demand forecasting, GradientBoosting churn prediction, AutoML feature proposal
Layer 1 — Canonical Schema
ia.* views: v_ventas_detalle, v_inventario_diario, v_cxc_detalle, v_churn_clientes, v_pronostico_demanda...
Layer 0 — Data Foundation
SIGAF (Oracle) ETL → Postgres (my-postgres) via ETL pipeline, TRUNCATE+INSERT refresh

04 Layer 0 — The Data Foundation (ETL Pipeline)

Every compound AI system is only as good as its data layer. APEXiA's foundation is a carefully constructed ETL pipeline that mirrors a legacy Oracle database (called SIGAF) into a modern PostgreSQL instance.

The Source of Record: SIGAF

Harder SRL's business operations — orders, inventory, payments — are managed in an Oracle database called SIGAF. This is a faithful, read-only mirror. The APEXiA system never modifies raw SIGAF data. All transformations happen at the view layer (Layer 1).

Hard rule embedded in the system: Raw tables (in schemas like cxc.*, fat.*, inv.*, cnt.*) are SIGAF-faithful mirrors. No UPDATE, INSERT, DELETE, or ALTER is ever applied to them. All transformations live in the ia.* (canonical) or bi.* (materialized views) layer. The ETL's TRUNCATE+INSERT refresh is the only legitimate raw-table mutation.

One Shared Postgres for All Tenants

The entire stack runs on a single PostgreSQL instance (Docker container my-postgres, port 5432). Each tenant gets their own database, role, and read-only user. This is the tenant_NNNN pattern — privacy-by-design numbering that prevents tenants from enumerating each other via Postgres system catalogs.

# For tenant_0001 (Harder SRL): docker exec -i my-postgres psql -U harder_user -d tenant_0001 -c "SELECT COUNT(*) FROM ia.v_ventas_detalle WHERE año = 2025;"

The ETL runs in a separate Python environment (etl_env/venv) using scripts that connect to both the Oracle source and the PostgreSQL destination. The canonical migration orchestrator is promote_to_production.py, which handles schema drift detection and incremental column adds from SIGAF.

Important Caveat: Incomplete Data

SIGAF is known to be incomplete in tenant_0001 — approximately 74% of Access-delivery clients don't appear in SIGAF sales, and many deliveries live off-book. The compound system accounts for this: the data layer doesn't pretend completeness, and downstream analytics are aware of the coverage gap. This isn't a bug; it's a known constraint that shapes how the AI interprets results.

05 Layer 1 — The Canonical Schema Layer

Layer 1 sits between the raw SIGAF mirrors and the AI interfaces. It is the canonical abstraction layer — the ia.* schema — that makes multi-tenant scaling possible.

Why a Canonical Schema?

In a multi-tenant system, each tenant's source ERP looks different. SIGAF (the current source) has its own column names, table structures, and business conventions. Future tenants may use completely different ERPs. Layer 1 exists so that the AI layer never knows what ERP a tenant uses. It always talks to ia.v_ventas_detalle, ia.v_inventario_diario, etc. — columns and semantics that look the same regardless of what's underneath.

v_ventas_detalle

Sales transactions with product, client, date, margin

v_inventario_diario

Daily inventory levels per SKU across all warehouses

v_cxc_detalle

Accounts receivable — customer balances and aging

v_cxp_detalle

Accounts payable — vendor obligations and aging

v_gastos_resumen

Operating expenses summarized by category and period

v_pronostico_demanda

Demand forecast from IAxCientifico (XGBoost/Prophet)

v_churn_clientes

Churn predictions with severity buckets and explanations

These views are 17+ in total, each mapping business concepts (sales, inventory, receivables) into a consistent column shape. The chatbot's few-shot examples, schema docstrings, and SQL prompts all assume ia.* is portable across tenants — only the ETL below it absorbs source-specific quirks.

Multi-tenant design principle: When adding a column or view shape, ask: "Would another tenant's data have this column under the same name?" If no, push the divergence into ETL, not the ia.* layer. This keeps the AI layer generic without needing tenant-specific routing.

06 Layer 2 — The Intelligence Layer (IAxCientifico)

If Layer 1 is the "language" that both the business and AI share, Layer 2 is the predictive intelligence engine. This is IAxCientifico — the AutoML system that continuously improves forecast and churn models using real data.

Demand Forecasting

The demand model uses two algorithms in tandem:

Both are wrapped in a PL/pgSQL function (train_demand_model) running inside Postgres. The model trains on every scheduled run, producing predictions that land in the v_pronostico_demanda view — making them automatically available to Layer 3's chatbot.

Bug fix that proved critical: A runaway n_jobs bug in XGBoost (defaulting to all CPU cores) caused the model to hang for 63 minutes per run, pegging 30+ cores. The fix — adding 'n_jobs': 1 — brought training from 63 minutes to 17 seconds. This is a perfect example of why compound AI needs deep integration between components: the data pipeline (Postgres), the ML library (XGBoost), and the GPU model (Qwen) all interact through shared infrastructure that must be carefully calibrated.

Churn Prediction — From Recency Circular to Leading Indicator

The churn model went through a fundamental re-architecting. The original version predicted churn using a target that was recency-circular — it used features observed up to the cutoff point, but the target itself (whether a customer churns) was defined using post-cutoff behavior, creating leakage.

The fix reframed to a leading indicator: predict who goes dormant in the next 90 days using only features observed before a temporal cutoff. The result:

0.842
Baseline AUC (recency-only model)
0.875
Leading GBM AUC (+0.033 improvement)
0.636
PR-AUC (precision-recall, critical for imbalanced data)
0.103
Brier score (well-calibrated probabilities)

The leading GBM is a HistGradientBoostingClassifier that ingests activity/RFM windows, decline trends, product breadth/HHI, margin, and seasonality. It refuses to ship if it can't beat the recency-only baseline, and it produces per-customer explanations in Spanish (seller Spanish, the target audience). The model caught non-obvious drifters — clients 137 days silent on a 195-day cadence — that a simple recency rule would have missed.

AutoML: Self-Proposing Feature Engineering

This is where APEXiA becomes genuinely compound. IAxCientifico's AutoML system doesn't just train models — it proposes new features autonomously.

1

Feature Proposer (Qwen)

The Qwen LLM (via the same vLLM endpoint, :8011) proposes new features in a constrained DSL: windowed aggregates, ratios, and deltas over monto/n_prod/margen. Each proposal includes a description in Spanish and a justification.

2

Feature Evaluator (Postgres + Python)

Each proposed feature is evaluated on an out-of-time split. A threshold gate (AUC lift ≥ +0.002) determines whether it passes. This prevents circularity: features that just memorize the training window are rejected.

3

Two-Tier Proposal: Qwen + Claude/Opus

Both tiers fire end-to-end. Qwen and Claude each propose 5 features per run. In one validated run, all were rejected (closest: +0.00193, just under the bar). The registry tracks provenance — who proposed what, when, and the evaluation result.

4

Self-Healing: Auto-Disable Degrading Features

Features that consistently fail evaluation get self-disabled (enabled=FALSE). The system cleans up its own feature registry, keeping only the useful ones. This is a feedback loop that compounds improvement over time.

Key insight: The feature proposer isn't just "generating random ideas." It operates within a constrained leakage-safe DSL. The proposed features must follow strict rules (windowed aggregates over pre-cutoff windows, evaluated only over that window). This is compound intelligence: the LLM proposes, the deterministic evaluator validates, the database records.

Resurrection Model — Customer Reactivation + Procurement Signal

A fascinating design: the churn system doubles as a supply-chain procurement signal. Some finished items are manufactured only because one specific customer orders them (made-to-order demand). When that customer lapses, raw materials stop being procured. The resurrection model predicts which lapsed customers reactivate and when, enabling procurement planning with lead time awareness.

The detector identifies made-to-order items by ranking demand predictions by per-product WMAPE (Weighted Mean Absolute Percentage Error). The worst-forecast products are exactly the single-customer items:

Product 00000844: 1 customer, 100% concentration, WMAPE 0.87 Product 00000834: 98% one buyer (Pan American Gypsum) Product 00000808: 79% one buyer (project demand) → These are the items the aggregate demand forecast structurally misses → The resurrection model predicts their reactivation timing → Procurement plans raw material purchases accordingly

This is compound AI at its best: the interaction between the demand forecasting system, the churn prediction system, and the procurement signal creates a business capability that none of the individual components could provide alone.

07 Layer 3 — The Interface Layer (IAxAnalista & IAxCRM)

Layer 3 is the user-facing surface. It has two components that together serve all of Harder SRL's analytical and operational needs.

IAxAnalista — Natural Language to SQL Chatbot

The flagship product. Users (sales reps, the factory owner, accountants) ask questions in Spanish about the business. The system translates that into SQL against the ia.* views, executes it, and returns a natural-language analysis in Spanish.

Input: Spanish natural language

"¿Cuánto vendimos en mayo vs. abril?" or "¿Qué productos tienen inventario bajo?"

Step 1: Classifier (Qwen at :8011)

Emits INTENT:DOMAIN|CONFIDENCE|ALTERNATES — e.g. INTENT:VENTAS|HIGH (single domain). The classifier is itself a request to Qwen; vLLM prefix-caches each distinct system prompt, making repeated classification very fast.

Step 2: Router — Fast Path or Hybrid Path

CONFIDENCE=HIGH + single domain → fast path (scoped single-domain schema, one-shot SQL). Otherwise → hybrid path (union schema of primary + alternates, with self-correction retry on SQL failure).

Step 3: SQL Generation (Qwen at :8011)

One request to Qwen with the scoped schema docstring and few-shot examples. Temperature clamped to 0.1 for determinism. The SQL is cleaned (stripped of reasoning tags, markdown fences, anchors to last SELECT).

Step 4: Execution (Postgres)

SQL runs against tenant_0001 with search_path = ia, public. Read-only validation uses the harder_user role; writes use postgres superuser.

Step 5: Interpretation (Qwen at :8011)

Results fed back to Qwen for natural-language summary in Spanish. Temperature uses configured 1.0 for conversational tone. Results + interpretation returned to user.

Two AI Tiers

Feature Standard (Qwen) Premium (Claude)
Model Qwen3.6-35B-A3B MoE (local, :8011) Claude Sonnet (Anthropic API)
Architecture Anchor Engine 2.0 (5-step pipeline: classify → route → SQL → execute → interpret) Autonomous — Claude gets full schema + execute_sql tool, does everything in one shot
Cost $0 (local GPU) $0.018/message (est.)
Speed ~100 tok/s single GPU Variable, API-dependent
Self-correction Yes — hybrid path has self-correct retry on SQL failure Inherent — Claude can retry itself

IAxCRM — WhatsApp Order Intake

A complementary system that serves sales reps. Sales reps send orders via WhatsApp. The CRM parses them, manages seller cards, generates estado-de-cuenta (statement) PDFs, and pushes orders into the SIGAF system. This is compound in a different way: it combines LLM-powered intent extraction from WhatsApp messages with deterministic order processing and PDF generation.

💬

WhatsApp → LLM Intent Extraction

Qwen parses WhatsApp messages to extract product codes, quantities, delivery dates. Deterministic validation follows to ensure all required fields are present.

📊

Seller Cards + Estado-de-Cuenta PDFs

Each seller gets a card showing their pipeline, recent orders, and account balance. Customers can request "estado de cuenta" — a PDF statement — which is generated on demand and delivered back via WhatsApp.

📦

SIGAF Push

Validated orders are pushed into the Oracle SIGAF system via a dedicated connection. The push includes FECHAENTREGA, IMPUESTO (18% IBIIS), TASADECAMBIO fields, making the order fully operational in the legacy system.

08 The Anchor Engine: Compound Intelligence in One API Call

The Anchor Engine is the routing core of the IAxAnalista system — the piece that makes the compound architecture work in practice. Without it, the system would just be a fancy SQL query builder. With it, the system has confidence-aware routing and self-correction on failure.

It operates on three signals from the classifier:

The six intents (CHAT, DATA, ANALYSIS, DATA+ANALYSIS, FOLLOWUP, FOLLOWUP+DATA) further determine whether additional follow-up context is loaded from a session cache (TTL 30 min, max 200 sessions, 3-entry results ring buffer for multi-turn sequences).

Why this is compound intelligence: The classifier doesn't just return a label — it returns a structured response that drives deterministic branching in the orchestrator. That orchestrator then assembles a scoped prompt, sends it back to the same model (Qwen), and if execution fails, automatically retries with a corrected schema. This is a feedback loop within a single API call. No monolithic prompt does this.

Self-Correction Mechanism

SQL generation failures are caught at execution time. When a query fails (column not found, table not found, type mismatch), the hybrid path triggers a self-correction retry:

1

Query Executes → Fails

Postgres returns an error (e.g. "column porc_margen does not exist").

2

Error is Parsed → Mapped to Alias Fix

The system has a _BLOCKED_COLUMNS registry (~5 top hallucination patterns) that maps common LLM hallucinations: margen_pct → porc_margen, monto_neto → monto_neto_rd, etc.

3

Query Retries with Fixed Column Names

The corrected query executes. If it succeeds, the workflow continues to interpretation. If it fails again, the route falls through to the hybrid path with expanded schema.

This error-handling chain — parse, map, retry — is deterministic code, not a prompt trick. That's what makes compound AI reliable: the failure modes are understood, mapped, and handled programmatically.

09 Multi-Model Orchestration: Qwen + Claude + XGBoost + Prophet

One of the defining characteristics of a true compound AI system is using the right model for the right sub-task. APEXiA demonstrates this principle across four distinct AI models, each in its optimal role:

Model Role Why This Model
Qwen3.6-35B-A3B (MoE, 3B active / 35B total) Classification, SQL generation, natural-language interpretation Runs locally on AMD Radeon R9700 GPUs via vLLM. Fast (~100 tok/s), cheap ($0 inference), large 262k context window. MoE architecture makes it efficient enough for real-time use.
Claude Sonnet Premium tier alternative — autonomous SQL + analysis Superior SQL generation on complex queries. Uses Anthropic API ($0.018/msg). Acts as a fallback for users who need premium-grade accuracy.
RapidXGBoost (HistGradientBoostingClassifier) Churn prediction, demand regression Tabular data specialists — far better than any LLM at structured regression/classification. Trained on panel data from the ia.* views.
Prophet Seasonal demand forecasting Time-series specialist for trend + holiday seasonality. Bulk-horizon prediction (predicts entire future window in one call). Used in tandem with XGBoost for ensemble advantage.

Why 4 models instead of 1: A single LLM cannot be the best classifier, SQL generator, forecaster, and churn predictor simultaneously. Each sub-task benefits from a model specialized for it. The orchestration — deciding which model handles which part of the pipeline — is the compound intelligence itself.

The model diversity extends to the AutoML layer too: the feature proposer uses Qwen (Standard) and Claude/Opus (Premium), evaluating proposals against XGBoost and Prophet models trained on actual production data. This is a meta-learning loop: the LLM proposes features, the ML models evaluate them, the evaluation results inform future proposals.

10 Self-Healing & Self-Cleaning AutoML Loops

Perhaps the most sophisticated aspect of APEXiA's compound architecture is the AutoML system's ability to improve itself autonomously. The feature engineering loop doesn't just train a model — it maintains a living feature registry that grows, prunes, and evolves as new data arrives.

The Feature Registry

Each proposed feature is recorded in cientifico.demand_feature_registry with:

Self-Healing: Auto-Disable Degrading Features

When a feature consistently fails evaluation (below the AUC lift threshold), the system automatically sets enabled=FALSE. This means:

Self-Cleaning: Registry Maintenance

The system also tracks which features are tied to exogenous data sources (weather_BCRD) that haven't been wired in yet. These are queued for re-proposal when those data sources become available. It's a waiting queue of ideas that the system holds and revisits when the precondition is met.

Self-healing AutoML is the "compound" multiplier: Each AutoML run improves the feature set, which improves the model, which produces better demand forecasts and churn predictions, which feed back into the ia.* views, which the chatbot uses to give better answers. The system compounds improvements over time — this is the literal meaning of "compound AI."

11 Observability Without Dashboards (Self-Monitoring)

Observability is critical in compound AI systems because failures can originate in any component. APEXiA incorporates monitoring at multiple levels:

Monitoring Layer Mechanism
API Health /health endpoint returning version, active backends, classifier reachability, session count
Test Suite ~130 tests across 11 suites (inventory, financial, mixed, multi, followup, forecast, churn, yoy, etc.). Post-ship regression gate.
Classifier Benchmarks bench_classifier_v2.py — accuracy ≥95% format + ≥95% HIGH precision required on production
Throughput Benchmarking apexia_benchmark.sh — per-user and aggregate tok/s sweep at various concurrency levels, prompt size parameters
AutoML Health Bug #3 NOTICE capture, Aggregate Trials Revert (auto-disabled), self-healing checks, grain watchdog (count(distinct entity_key) == count(*))
ETL Health detect_stuck_runs() and cleanup_stuck_runs() functions in Postgres, n8n execution watchdog

Notice there are no traditional dashboards for API monitoring. The health is checked programmatically via script execution. This is consistent with the compound AI philosophy: observability should be automated, actionable, and integrated into the pipeline, not something an engineer has to actively look at.

Key principle: automate the observability loop. When a test fails, the BUGFIX_QUEUE.md is updated. When an AutoML run stalls, the watchdog detects it. When a classifier degrades, benchmarking flags it. The system monitors itself.

12 Multi-Tenant Privacy-by-Design

APEXiA was designed from the start to serve multiple tenants — different companies, each with their own data, ERP, and business logic. The compound architecture makes multi-tenancy clean:

Because the ia.* schema is portable, onboarding a new tenant is an ETL problem only — the compound AI architecture itself doesn't need to change. This is what makes the system genuinely scalable.

13 Monolithic AI vs Compound AI: A Side-by-Side

Here's how APEXiA (compound) compares to a monolithic system that tries to do the same thing in a single LLM call:

Property Monolithic Approach APEXiA (Compound)
Architecture One giant system prompt (~10k+ tokens) Modular: classifier → router → SQL-gen → execute → interpret (5 explicit stages)
Error handling Retry with the same prompt — same failure mode Deterministic self-correction: parse error → map fix → retry with corrected schema
Model used One model for everything — mediocre at all tasks Qwen for classification/SQL, Claude for premium, XGBoost for tabular, Prophet for time-series
Forecasting LLM tries to predict numbers in a prompt — unreliable Dedicated ML pipeline (XGBoost + Prophet), AutoML-proposed features, evaluation gates
Churn prediction LLM analyzes past interactions — circular, leaky Leading GBM with temporal cutoff, out-of-time validation, per-customer explanations
Multi-tenant Tenant-specific prompt tweaks or separate giant prompts Portable ia.* schema, ETL absorbs source differences, zero AI-layer changes
Self-improvement Manually prompt-engineer better few-shot examples AutoML proposes features, evaluates them, auto-disables failures, maintains registry
Observability Hope the LLM formatted correctly 130+ tests, classifier benchmarks, throughput benchmarks, stuck-run watchdogs
Cost High — one expensive model doing everything $0 for inference (local Qwen), Claude Premium opt-in for edge cases (~$0.018/msg)
Reliability over time Drifts — few-shot examples rot, model capabilities shift AutoML compounding features, view snapshots, CI/CD test gate before shipping

The difference isn't just technical — it's philosophical. A monolithic approach treats the LLM as a universal problem-solver. A compound approach treats the LLM as one component among many, optimizing the overall system's reliability, cost, and correctness.

14 Design Philosophy: Open-Source-First, Local-by-Default

APEXiA's design decisions reflect a clear philosophy that shapes its compound architecture:

Open-Source-First

The default stack is entirely open source: Qwen model (local), vLLM (inference server), PostgreSQL (database), n8n (orchestration), Scikit-learn/XGBoost/Prophet (ML libraries). Paid APIs (Anthropic Claude) are a removable edge — used only when the local stack genuinely can't handle the task. This keeps costs near zero and prevents vendor lock-in.

Local by Default

All inference runs locally on the operator's own hardware: two AMD Radeon AI PRO R9700 GPUs with ROCm/vLLM serving. The model is Qwen3.6-35B-A3B-MXFP4 (4-bit quantized MoE), running TP2 across both cards with MTP speculative decoding (34.7 → 76.7 tok/s, 2.2× speedup). This means:

The Model as the Default, Paid as Edge

This inversion — local model as default, expensive API as opt-in — is the opposite of most AI startups. It reflects a pragmatic understanding: for a Dominican Republic SMB, cost predictability and data privacy matter more than squeezing out the last 3% of accuracy.

Infrastructure reality check: The second R9700 is currently a BIOS/firmware enumeration issue. The single-GPU stack is fully operational — no production impact, just a throughput cap. This is a hardware limitation, not an architectural one. The compound design survives even partial hardware degradation.

15 Conclusion: Why This Matters for Everyone Building AI

APEXiA isn't a toy project or a proof-of-concept. It's a live, production system serving real business operations with real data. It handles real decisions — procurement planning, credit risk, demand forecasting, sales strategy — generated by ordinary people asking questions in plain Spanish.

But what makes it genuinely noteworthy as an example of Compound AI isn't that it works (there are many working AI systems). What makes it noteworthy is how it's composed:

This is what Compound AI should look like in production. Not a chatbot with a fancy prompt — a coordinated system of specialized components, connected by deterministic logic, monitored by automated tests, and capable of self-improvement over time.

The bottom line: If you're building an AI application today, your architecture matters more than your prompts. Choose the right model for each sub-task, handle errors programmatically, build in self-correction, and don't try to do everything in one LLM call. Compound AI isn't a buzzword — it's the engineering practice of building systems that survive contact with the real world.

APEXiA proves this principle in action. It replaces what would have been a team of analysts, a data scientist, and a BI developer — and it does it while running on a single box with two consumer-grade GPUs, costing almost nothing in operational expenses, and serving the specific business logic of a Dominican Republic construction materials factory.

That's compound AI. Not theory. Not a research paper. Shipped software.

Built by Ludwid Reyes · APEXiA · Miami, FL — serving Latin America

v17.1-qwen-unified · 2026-05-31

— The system you're reading about is currently live and serving production traffic at localhost:8101 —