Building an AI Orchestration System at Home: What Actually Worked

Ayra ix · 8 min read · Talk to the AI panel about this article as you read

Spoiler: it wasn't LangChain

Buying a mid-range GPU "just to try something" is a lie a lot of people tell themselves. It's convincing enough to believe right up until the box arrives, the GPU goes into the PCIe slot, and the first ollama pull llama3.1:8b runs. The model downloads. A prompt goes in. It responds. It's 11pm on a Tuesday.

Staying up until 4am talking to it isn't about being impressed — it's a local 8B model, so it's wrong about things with confidence, which is a different kind of impressive — it's that something clicks. This thing is running on local hardware. Not billing per token. That's the hook.

Six months later: five pipelines running 24/7. Here's what got learned, what got thrown away, and the thing the tutorials never tell you.

Inference

Local · GPU-backed ●

Models

Multi-model · rotating pool

Uptime

Always-on

Pipelines

Active · automated

Every Framework Failed the Same Way

The first two months went the way they go for everyone: LangChain first, then AutoGen, then a custom Python orchestrator, then Flowise, then CrewAI. Each one had a working demo in a day. Each one fell apart in production for exactly the same reason.

LangChain was the worst. A RAG pipeline worked fine for about a week. Then it started returning inconsistent results that were hard to diagnose. The debugger led straight into LangChain's own source code, not the pipeline's code. The chain had swallowed the actual error somewhere inside an abstraction. Dozens of Python packages installed, no clear sense of which ones were doing anything. Thrown out after three weeks.

AutoGen was more interesting but the same problem at a different layer. Three agents — a planner, an executor, a critic — worked great on simple tasks. Then the planner and critic disagreed on something, and there was no way to tell what messages had passed between them or why the conversation had gone sideways. No execution trace. No "here is what each agent said and what it decided." Thrown out after two weeks.

A custom Python framework lasted a month before it became clear it was just a worse version of n8n. Error handling, retries, scheduling, logging — that's infrastructure code, not the actual problem being solved. That one stings more, because controlling the code feels like understanding the system right up until it isn't.

cat /var/log/frameworks-tried.log

ERROR LangChain 0.0.x — abstraction layer broke on every model update

ERROR AutoGen — agent loops with no exit condition, 40min debug session

ERROR Custom Python orchestrator — 800 lines, works once, broke on deploy

WARN Flowise — good UI, state management hit walls at workflow 3

WARN CrewAI — impressive demos, production reliability: concerning

OK n8n — first automation ran in 20 minutes. Still running.

The pattern that kept showing up

Every framework tried abstracted away exactly the things you need to see when something breaks. When a pipeline fails at 3am and you wake up to a stuck job, there's one question that needs a fast answer: what happened at each step? None of these tools made that easy. That's not a minor inconvenience. That's a tax on every maintenance hour.

n8n Worked on the First Try

n8n was already in use for basic automations — nothing fancy, some webhook-to-chat stuff. The next need was a scheduled ingestion workflow that would fetch external records, filter them against qualification rules, and push results to a phone via ntfy. Building it in n8n was the path of least resistance. One node made an HTTP call to http://localhost:11434/api/generate — the Ollama REST endpoint — to classify each record.

It worked. The workflow showed every node. The Ollama response was JSON, inspectable right there in the execution log. When something broke, it was obvious which node had failed, what the input was, what came back. Fixed in minutes, not hours.

Everything got rebuilt after that.

"n8n handles orchestration flow. A local model handles inference. HTTP calls in between. That's the whole architecture."

A content-scoring pipeline came first. Then a system health monitor. Then a task-execution pipeline that reads a written task description and carries out the steps unattended. Then a memory layer. Then an activity dashboard. All of it is workflow automation making HTTP calls to a local model, with results stored in a database. There is almost no custom code left outside of the workflow tool — what remains is for the handful of integrations that need real session management, not because the orchestration itself required it.

What's Actually Running

Five pipelines. All running on self-hosted infrastructure with GPU-backed local inference via Docker and a rotating model pool.

Content scorer. Runs every three hours. Ingests external records, filters against configurable qualification gates (relevance, category, quality threshold), sends each passing record to a code-tuned local model for scoring, and creates a task-tracker entry for high-confidence matches. A human reviews before any irreversible action. This is the pipeline that made the whole project worth it — it surfaces signal that manual review would have missed.

System health monitor. Every 15 minutes, checks disk temps, pool status, container health, GPU utilization. If anything looks off, a push alert goes out. A smaller, fast model handles the classification here — light on GPU memory, and "is this normal or not" doesn't need a large model.

Task executor. A written task description goes in; the workflow tool picks it up, sends it to a reasoning-focused local model to plan the steps, and executes each step against the target system. The model choice matters here — a model that thinks in chains before answering is noticeably better at multi-step task decomposition than the others.

Memory system. After each session, a workflow pulls the transcript, sends it to a local model to extract learnings, and writes them into a vector database as embeddings. The collection holds thousands of points. Retrieval runs through a lightweight proxy layer so the embedding model can be swapped without touching the workflow.

Activity dashboard. Aggregates events from all of the above into a live view. Mostly a simple event stream into a database table, but it means the system's recent activity is visible at a glance.

The stack, actually: a workflow-automation tool for scheduling and data flow, local models for inference (a code/classification model, a reasoning model, a fast bulk model), a lightweight proxy so model changes don't require touching workflows, a vector database for memory, push notifications for alerts, a relational database for state.

LangChain RAG pipeline: three weeks, worked for one week, broke mysteriously, spent more time reading framework source than the pipeline's own code. Out.

AutoGen multi-agent setup: planner + executor + critic. Great demos. When the agents disagreed on something real, there was no trace of why. No visibility into the conversation. Out after two weeks.

Custom Python orchestrator: hand-built, lasted a month, turned out to be a worse version of an existing workflow tool with no UI. Out.

Flowise: looked promising. Hit a wall the moment any logic outside the node library was needed. "Low-code" turns out to mean "easy until it doesn't apply." Out.

CrewAI: same visibility problem as AutoGen. No way to see what was happening inside an agent run without reading undocumented internals. Out.

Every one of these still sits on disk somewhere, as a reminder that frameworks are not the point.

What Nobody Tells You About Model Selection

Every tutorial says "just use one model for everything" and moves on. That's fine for a demo. In a real system, running one model for everything is slow, memory-hungry, and often wrong.

The three-model pattern that works

Fast bulk model: classification, yes/no decisions, quick extraction. Smaller models fit comfortably on local GPU memory and respond in under a second. Use it for anything where a list is being processed.

Code/technical model: code generation, structured JSON output, anything technical. A solid code-tuned local model is consistently better than general-purpose models of similar size on this category of task.

Reasoning model: multi-step planning, anything that needs to think before answering. Slower, but the chain-of-thought output is useful — logging it sometimes surfaces something the final answer doesn't.

A lightweight proxy layer sits in front of all of these, so workflows call one consistent endpoint with a model name. Swapping one model for another later means changing one config line, not every workflow.

curl -s http://localhost:11434/api/tags | jq '.models[].name'

"code-model:14b"

"reasoning-model:14b"

"fast-model:8b"

"embedding-model:latest"

# Multi-model · Local GPU-backed · Always-on

The Actual Lesson

The system running now shares almost nothing with week one. Week one had LangChain, a different database schema, a logging approach that seemed clever at the time, and a custom orchestrator that seemed smart. The current system works because it got rebuilt five times. Each rebuild taught one thing and threw out two.

The move that changed everything wasn't a technical decision — it was accepting the need to see everything. Every input. Every output. Every failure. Any tool that hid those things was the wrong tool, regardless of how good its GitHub stars looked.

For anyone starting: pull a local model, run it, call it with curl, look at the raw JSON. Build three things that call it directly before installing any framework. That teaches more in three days than three weeks of documentation.

The late-night conversations with the local model are still happening. Now they happen automatically, unattended. That's a better outcome than "just trying something" had any right to produce.