iAxCientífico: Why a Compound AI System Beats Purely Agentic AutoML

How a Dominican materials factory runs self-improving demand forecasting on a single server — and what Berkeley and Databricks research tells us about why this architecture works.

The demand for forecasting in small and mid-sized manufacturing is enormous. A construction-materials factory in the Dominican Republic needs to know exactly how much cement, steel, and lumber to order each week. Missing the mark by five percent means either lost sales from stockouts or dead inventory that ties up cash flow.

The conventional answer is "AutoML" — throw a black box at your data and hope it produces an accurate model. But black boxes are opaque, hard to audit, and notoriously difficult to improve when they start degrading. And the alternative pitched today is "agentic AI" — point a Large Language Model at your pipeline and let it decide the next step.

We built the third option: iAxCientifico, a Compound AI System where an LLM proposes features, real solvers train models, PostgreSQL stores institutional memory, and n8n coordinates the orchestration. Every component does what it's actually good at. None of them is the system. The system is the cooperation.


The Research: Why Compound Beats Monolithic (and Agentic)

The argument for compound systems over monolithic LLMs or pure agentic loops is no longer theoretical. It's the consensus among the people actually building production AI.

Berkeley AI Research (BAIR)

In their landmark February 2024 paper "The Shift from Models to Compound AI Systems", Matei Zaharia, Omar Khattab, and colleagues at UC Berkeley's AI Research lab analyzed what's actually delivering state-of-the-art results across AI. Their finding was definitive:

"State-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models."

The evidence was clear. AlphaCode 2 achieved 85th-percentile human performance on coding contests not through a single larger model, but through a carefully engineered system: generate 1M candidate solutions, filter them, score them, cluster them. AlphaGeometry combined an LLM with a symbolic math engine to reach between silver and gold at the International Math Olympiad. Microsoft's MedPrompt exceeded GPT-4's medical-exam accuracy by 9% through a chain-of-thought + nearest-neighbor-search + 11-sample ensemble system.

The researchers identified four reasons compound systems outperform monolithic approaches:

Databricks Mosaic AI

Databricks reached the same conclusion from a different direction — the production deployments of their customers. In their June 2024 announcement of Mosaic AI Agent Bricks, they documented a concrete case study:

The FactSet Case Study

Financial research firm FactSet deployed a commercial LLM for their Text-to-Financial-Formula task. The monolithic approach achieved 55% accuracy. When they modularized the task into a compound system — classifying the query, retrieving relevant formulas, generating the formula, validating syntax — accuracy jumped to 85%.

55%
Monolithic LLM

85%
Compound System

Their research showed that 60% of LLM applications already use RAG and 30% use multi-step chains. The industry was already practicing compound systems before the research named it.

Additional supporting findings


The iAxCientífico Architecture

iAxCientifico is the AutoML component of the APEXiA product family, running for Harder SRL, a construction-materials factory in the Dominican Republic. It builds and continuously maintains demand-forecasting and churn-prediction models — without a human data scientist sitting at the keyboard.

The architecture is a textbook example of the compound AI pattern. Here's the full system diagram:

┌──────────────────────────────────────────────────────────────────┐ │ Postgres (shared instance) │ │ │ │ ia.* views analysis layer (SQL views on raw schemas) │ │ bi.mv_* views dashboard layer (materialized views) │ │ cientifico.registry theory log (62 rows: 17 enabled, 45 disabled) │ │ cientifico.run_state model version, metrics, HP state │ │ cientifico.predictions daily forecasts & churn predictions │ └───────────▲──────────────────────────────────────────────────────┘ │ read_schema + read_registry │ ┌───────────┴──────────────────────────────────────────────────────┐ │ Local LLM Serving (vLLM, :8011) │ │ │ │ Qwen3.6-35B-A3B-MXFP4 (MoE, 3B active / 35B total) │ │ AMD Radeon AI PRO R9700 x2 (RDNA4, TP2) │ │ ~100 tok/s single-stream, ~1000 aggregate │ │ Context window: 262,144 tokens │ └───────────▲──────────────────────────────────────────────────────┘ │ prompt → propose → validate │ ┌───────────┴──────────────────────────────────────────────────────┐ │ n8n Orchestrator (:5678) │ │ │ │ Active workflows (19): │ │ 1. AutoML Forecast FE+HP Coupled - Qwen (EasvsgTVPxjCEaLe) │ │ 2. AutoML Churn FE+HP Coupled - Qwen (z5lvyFX82b8gqw1b) │ │ 3. Executive Summary - Qwen (IKdkTuIBLthJ-Ysrg9VHN) │ │ 4. Monthly Forecast Snapshot (2VxcOlIQ616eR7xw) │ │ 5. + 5 more (ETL heartbeat, CRM, notifications) │ └──────────────────────────────────────────────────────────────────┘

Four Cooperation Subsystems

The system has four cooperating components. None is the system. The system is what you get when they cooperate.

1. ETL Pipeline (cron, 03:00)

Ingests the day's source-system data from SIGAF (the Oracle accounting system), refreshes materialized views across ia.* and bi.* schemas. This is the data substrate the LLM reasons over. The ETL is deterministic Python — no LLM involved.

2. Feature Engineering AutoML Loop (n8n, daily)

An LLM (Qwen3.6 on :8011) is prompted with:

The LLM proposes one candidate feature with SQL/Python build code and a paragraph of reasoning. The system materializes the feature on the full history, retrains the model, measures WMAPE delta, and commits only if at least 4 of 6 hyperparameter-perturbed trials clear the bar.

Rejected features go to the cientifico.demand_feature_registry with a revisit_when tag — "this would have moved WMAPE if cement_price_data existed."

3. Hyperparameter Loop (n8n, daily at 04:00)

Reads the current feature set, proposes new hyperparameter regions, retrains with Prophet and XGBoost, and writes new predictions to a daily table. Coupled with the FE loop: when FE tentatively accepts a feature, the HP loop retunes for the new feature space; when HP substantially shifts, the FE loop re-evaluates recently-disabled features.

4. Executive Summary Generator (n8n, daily at 05:00)

Reads dashboard materialized views, feeds structured KPI data to the LLM, and generates a plain-language executive summary in Spanish. Written back to bi.executive_summaries as an UPSERT.


The Theory Log: Institutional Memory in PostgreSQL

The most important table in the system isn't any operational view or materialized dashboard. It's cientifico.demand_feature_registry — a theory log where every feature the system has ever tried lives permanently, with its reasoning.

CREATE TABLE cientifico.demand_feature_registry (
  id               SERIAL PRIMARY KEY,
  feature_name     TEXT UNIQUE NOT NULL,
  feature_type     TEXT NOT NULL,        -- 'calendar' or 'query'
  build_code       TEXT NOT NULL,        -- materialization SQL/Python
  forecast_code    TEXT NOT NULL,        -- inference-time expression
  prophet_compatible BOOLEAN DEFAULT TRUE,
  xgboost_compatible BOOLEAN DEFAULT TRUE,
  fill_value       NUMERIC DEFAULT 0,
  enabled          BOOLEAN DEFAULT TRUE,
  proposed_by      TEXT DEFAULT 'manual',
  reasoning        TEXT,                   -- the theory
  visited_when     TEXT[],               -- "what data is needed to revisit"
  run_id           UUID,                 -- run-id discipline
  created_at       TIMESTAMPTZ DEFAULT NOW()
);

As of today, 62 rows exist. 17 are enabled (active in production). 45 are disabled. The acceptance rate is around 27% — which is high relative to traditional AutoML because the LLM proposals are hypothesis-driven, not brute-force kitchen-sink. Every accepted feature has a theory attached. Every rejected feature has a reason.

A concrete case: the quincenal payday cycle

In the Dominican Republic, most workers are paid twice a month (the 15th and end of month). This is the quincenal cycle, and it's a real driver of construction-materials demand — small contractors stock up around paydays.

Claude Opus proposed it 9 times in different shapes. All were rejected. Qwen proposed a similar feature once — a clean binary flag — and it landed. Why? Qwen had a different prior about what makes a good feature for gradient-boosted trees. Trees don't need sinusoidal encodings or distance metrics. They need binary splits.

This is proposer diversity in action. Running multiple LLMs as proposers produces a wider search space than any single model. This is the same logic that makes ensemble methods work, applied at the meta level of feature engineering.


Why Not Just Use an Agent?

This is the question that matters most in 2026. An agentic loop would point an LLM at the data, give it tools, and let it decide the next step end-to-end. On the surface this seems like the natural next step after compound systems. But for production forecasting, it has three structural problems.

Problem 1: Agentic loops trade speed for autonomy

Problem 2: Hallucination compounds across steps

Problem 3: There's no math floor in pure agentic systems

The right way to put it: agentic AI is best at tasks where a single capable model can hold the whole problem in its head. For genuinely complex tasks — building and continuously maintaining a production forecasting model against drifting data — the engineering discipline that has worked for forty years (separation of concerns, real solvers for math, audit trails) doesn't go away because the language model is now smart.


How Linux Powers Everything

iAxCientifico runs on a single host — 128GB RAM, two AMD Radeon AI PRO R9700 GPUs, no Kubernetes, no managed orchestration. Here's how Linux components combine:

vLLM (Qwen serving, port :8011)

The local LLM is served by vLLM with MXFP4 quantization, MTP speculative decoding, and a qwen3_xml tool-calling parser. It's managed as a systemd service: apexia-vllm-qwen.service. The MoE architecture (3B active / 35B total) means only 3 billion parameters activate per token — making it fast enough for real-time tool-calling in the compound loop.

PostgreSQL (single shared database)

All schemas live in one container: my-postgres. The search path is set to ia, public. The cientifico.* schema holds feature registries, run state, and trained models. The bi.* schema holds materialized dashboard views. The read-only harder_user role validates SQL.

n8n (orchestrator, port :5678)

A single Docker container with its own Postgres. n8n's visual workflow editor lets me iterate on cost/quality tradeoffs by swapping models in a single HTTP node — different URL, same workflow shape. This low-friction swapping is what actually enables proposer diversity: the Claude loop, the Qwen loop, the Gemma experiment — all coexist and can be activated on demand.

systemd (service management)

Every service is a systemd unit: vLLM serving, the ETL pipeline, backup notifications, ETL heartbeat. No Kubernetes — no scheduler, no executor, no metadata DB. Just systemctl start and journalctl. This is the budget-appropriate choice for a single-operator, single-host stack.


The Compound Advantage: What Berkeley and Databricks Predict, Production Delivers

The Berkeley AI Research paper identified four reasons compound systems outperform monolithic approaches. Here's how iAxCientifico maps to them:

BERKELEY PRINCIPLE HOW iAxcientífico IMPLEMENTS IT
System design > model scaling Purpose-built compound loop (FE proposal → real training → measured WMAPE) produces better forecasts than any single model would, and iterates faster than any training run could.
Systems are dynamic BCRD macroeconomic data integration, new external data sources — the revisit_when queue reactivates previously-rejected features when new data lands.
Control and trust Every feature has reasoning attached. Every decision is traceable to a run_id. The theory log is an audit trail — a black box isn't.
Variable performance goals Tiered proposer architecture: Qwen (local, free) for daily operations, Claude (paid API) for premium tier. Different costs, same architecture.

The Databricks FactSet case study is the real-world proof: 55% → 85% accuracy by modularizing the task into specialized compound steps rather than trusting a single model.


Why This Won't Be Superseded by Better Agents

There will come a day when agentic systems are fast enough, cheap enough, and accurate enough to handle end-to-end AutoML on a single model call. When that happens, the agentic version of this system might be possible.

But the compound version will still produce the same forecast for a fraction of the cost. The compute spent on language-model reasoning at every step of an agentic loop is exactly the cost that compound architectures avoid by routing math to math and language to language.

In other words: the agentic transition will turn compound architectures from "the only way that works" into "the cost-optimized way that works." For operators who care about margin per tenant — like a materials factory where forecast accuracy directly maps to inventory and cash flow — that's the version that matters anyway.

Architecture also benefits from two different model-progress trajectories simultaneously. Local open-source models (Qwen, Gemma) will keep improving — meaning the local proposer slot will silently get stronger, for free, on hardware already owned. Meanwhile, premium-tier frontier models will do genuinely agentic work that nothing today can match. The compound architecture hosts both; the operator doesn't have to predict which tier will dominate.


Looking Forward

The system has known unfinished edges — all tractable additions to the same loop:

  1. Substrate expansion. BCRD (Banco Central de la República Dominicana) macroeconomic indicators — construction PMI, GDP, inflation, remittances — are wired into the system. When new data lands, the revisit queue automatically fires re-proposals of previously-rejected features that needed that data.
  2. Multi-objective optimization. Currently optimizing for aggregate WMAPE. Adding peak-period accuracy (the moments operators care about most) and confidence-interval calibration.
  3. Drift detection on accepted features. Quarterly re-testing of every active feature under current conditions, with quiet retirement of features that have stopped pulling weight.
  4. FE↔HP coupling. When FE tentatively accepts a feature, firing a focused HP re-tune for the new feature space before committing. This is the single biggest architectural unlock remaining.

None require redeliberating the system. They're additions to the loop, in the same loop's spirit.


Further Reading