How a Dominican materials factory runs self-improving demand forecasting on a single server — and what Berkeley and Databricks research tells us about why this architecture works.
The demand for forecasting in small and mid-sized manufacturing is enormous. A construction-materials factory in the Dominican Republic needs to know exactly how much cement, steel, and lumber to order each week. Missing the mark by five percent means either lost sales from stockouts or dead inventory that ties up cash flow.
The conventional answer is "AutoML" — throw a black box at your data and hope it produces an accurate model. But black boxes are opaque, hard to audit, and notoriously difficult to improve when they start degrading. And the alternative pitched today is "agentic AI" — point a Large Language Model at your pipeline and let it decide the next step.
We built the third option: iAxCientifico, a Compound AI System where an LLM proposes features, real solvers train models, PostgreSQL stores institutional memory, and n8n coordinates the orchestration. Every component does what it's actually good at. None of them is the system. The system is the cooperation.
The argument for compound systems over monolithic LLMs or pure agentic loops is no longer theoretical. It's the consensus among the people actually building production AI.
In their landmark February 2024 paper "The Shift from Models to Compound AI Systems", Matei Zaharia, Omar Khattab, and colleagues at UC Berkeley's AI Research lab analyzed what's actually delivering state-of-the-art results across AI. Their finding was definitive:
"State-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models."
The evidence was clear. AlphaCode 2 achieved 85th-percentile human performance on coding contests not through a single larger model, but through a carefully engineered system: generate 1M candidate solutions, filter them, score them, cluster them. AlphaGeometry combined an LLM with a symbolic math engine to reach between silver and gold at the International Math Olympiad. Microsoft's MedPrompt exceeded GPT-4's medical-exam accuracy by 9% through a chain-of-thought + nearest-neighbor-search + 11-sample ensemble system.
The researchers identified four reasons compound systems outperform monolithic approaches:
Databricks reached the same conclusion from a different direction — the production deployments of their customers. In their June 2024 announcement of Mosaic AI Agent Bricks, they documented a concrete case study:
Financial research firm FactSet deployed a commercial LLM for their Text-to-Financial-Formula task. The monolithic approach achieved 55% accuracy. When they modularized the task into a compound system — classifying the query, retrieving relevant formulas, generating the formula, validating syntax — accuracy jumped to 85%.
55%
Monolithic LLM
85%
Compound System
Their research showed that 60% of LLM applications already use RAG and 30% use multi-step chains. The industry was already practicing compound systems before the research named it.
iAxCientifico is the AutoML component of the APEXiA product family, running for Harder SRL, a construction-materials factory in the Dominican Republic. It builds and continuously maintains demand-forecasting and churn-prediction models — without a human data scientist sitting at the keyboard.
The architecture is a textbook example of the compound AI pattern. Here's the full system diagram:
The system has four cooperating components. None is the system. The system is what you get when they cooperate.
Ingests the day's source-system data from SIGAF (the Oracle accounting system), refreshes materialized views across ia.* and bi.* schemas. This is the data substrate the LLM reasons over. The ETL is deterministic Python — no LLM involved.
An LLM (Qwen3.6 on :8011) is prompted with:
The LLM proposes one candidate feature with SQL/Python build code and a paragraph of reasoning. The system materializes the feature on the full history, retrains the model, measures WMAPE delta, and commits only if at least 4 of 6 hyperparameter-perturbed trials clear the bar.
Rejected features go to the cientifico.demand_feature_registry with a revisit_when tag — "this would have moved WMAPE if cement_price_data existed."
Reads the current feature set, proposes new hyperparameter regions, retrains with Prophet and XGBoost, and writes new predictions to a daily table. Coupled with the FE loop: when FE tentatively accepts a feature, the HP loop retunes for the new feature space; when HP substantially shifts, the FE loop re-evaluates recently-disabled features.
Reads dashboard materialized views, feeds structured KPI data to the LLM, and generates a plain-language executive summary in Spanish. Written back to bi.executive_summaries as an UPSERT.
The most important table in the system isn't any operational view or materialized dashboard. It's cientifico.demand_feature_registry — a theory log where every feature the system has ever tried lives permanently, with its reasoning.
CREATE TABLE cientifico.demand_feature_registry (
id SERIAL PRIMARY KEY,
feature_name TEXT UNIQUE NOT NULL,
feature_type TEXT NOT NULL, -- 'calendar' or 'query'
build_code TEXT NOT NULL, -- materialization SQL/Python
forecast_code TEXT NOT NULL, -- inference-time expression
prophet_compatible BOOLEAN DEFAULT TRUE,
xgboost_compatible BOOLEAN DEFAULT TRUE,
fill_value NUMERIC DEFAULT 0,
enabled BOOLEAN DEFAULT TRUE,
proposed_by TEXT DEFAULT 'manual',
reasoning TEXT, -- the theory
visited_when TEXT[], -- "what data is needed to revisit"
run_id UUID, -- run-id discipline
created_at TIMESTAMPTZ DEFAULT NOW()
);
As of today, 62 rows exist. 17 are enabled (active in production). 45 are disabled. The acceptance rate is around 27% — which is high relative to traditional AutoML because the LLM proposals are hypothesis-driven, not brute-force kitchen-sink. Every accepted feature has a theory attached. Every rejected feature has a reason.
In the Dominican Republic, most workers are paid twice a month (the 15th and end of month). This is the quincenal cycle, and it's a real driver of construction-materials demand — small contractors stock up around paydays.
Claude Opus proposed it 9 times in different shapes. All were rejected. Qwen proposed a similar feature once — a clean binary flag — and it landed. Why? Qwen had a different prior about what makes a good feature for gradient-boosted trees. Trees don't need sinusoidal encodings or distance metrics. They need binary splits.
This is proposer diversity in action. Running multiple LLMs as proposers produces a wider search space than any single model. This is the same logic that makes ensemble methods work, applied at the meta level of feature engineering.
This is the question that matters most in 2026. An agentic loop would point an LLM at the data, give it tools, and let it decide the next step end-to-end. On the surface this seems like the natural next step after compound systems. But for production forecasting, it has three structural problems.
The right way to put it: agentic AI is best at tasks where a single capable model can hold the whole problem in its head. For genuinely complex tasks — building and continuously maintaining a production forecasting model against drifting data — the engineering discipline that has worked for forty years (separation of concerns, real solvers for math, audit trails) doesn't go away because the language model is now smart.
iAxCientifico runs on a single host — 128GB RAM, two AMD Radeon AI PRO R9700 GPUs, no Kubernetes, no managed orchestration. Here's how Linux components combine:
The local LLM is served by vLLM with MXFP4 quantization, MTP speculative decoding, and a qwen3_xml tool-calling parser. It's managed as a systemd service: apexia-vllm-qwen.service. The MoE architecture (3B active / 35B total) means only 3 billion parameters activate per token — making it fast enough for real-time tool-calling in the compound loop.
All schemas live in one container: my-postgres. The search path is set to ia, public. The cientifico.* schema holds feature registries, run state, and trained models. The bi.* schema holds materialized dashboard views. The read-only harder_user role validates SQL.
A single Docker container with its own Postgres. n8n's visual workflow editor lets me iterate on cost/quality tradeoffs by swapping models in a single HTTP node — different URL, same workflow shape. This low-friction swapping is what actually enables proposer diversity: the Claude loop, the Qwen loop, the Gemma experiment — all coexist and can be activated on demand.
Every service is a systemd unit: vLLM serving, the ETL pipeline, backup notifications, ETL heartbeat. No Kubernetes — no scheduler, no executor, no metadata DB. Just systemctl start and journalctl. This is the budget-appropriate choice for a single-operator, single-host stack.
The Berkeley AI Research paper identified four reasons compound systems outperform monolithic approaches. Here's how iAxCientifico maps to them:
| BERKELEY PRINCIPLE | HOW iAxcientífico IMPLEMENTS IT |
|---|---|
| System design > model scaling | Purpose-built compound loop (FE proposal → real training → measured WMAPE) produces better forecasts than any single model would, and iterates faster than any training run could. |
| Systems are dynamic | BCRD macroeconomic data integration, new external data sources — the revisit_when queue reactivates previously-rejected features when new data lands. |
| Control and trust | Every feature has reasoning attached. Every decision is traceable to a run_id. The theory log is an audit trail — a black box isn't. |
| Variable performance goals | Tiered proposer architecture: Qwen (local, free) for daily operations, Claude (paid API) for premium tier. Different costs, same architecture. |
The Databricks FactSet case study is the real-world proof: 55% → 85% accuracy by modularizing the task into specialized compound steps rather than trusting a single model.
There will come a day when agentic systems are fast enough, cheap enough, and accurate enough to handle end-to-end AutoML on a single model call. When that happens, the agentic version of this system might be possible.
But the compound version will still produce the same forecast for a fraction of the cost. The compute spent on language-model reasoning at every step of an agentic loop is exactly the cost that compound architectures avoid by routing math to math and language to language.
In other words: the agentic transition will turn compound architectures from "the only way that works" into "the cost-optimized way that works." For operators who care about margin per tenant — like a materials factory where forecast accuracy directly maps to inventory and cash flow — that's the version that matters anyway.
Architecture also benefits from two different model-progress trajectories simultaneously. Local open-source models (Qwen, Gemma) will keep improving — meaning the local proposer slot will silently get stronger, for free, on hardware already owned. Meanwhile, premium-tier frontier models will do genuinely agentic work that nothing today can match. The compound architecture hosts both; the operator doesn't have to predict which tier will dominate.
The system has known unfinished edges — all tractable additions to the same loop:
None require redeliberating the system. They're additions to the loop, in the same loop's spirit.