I built self-evolving AI agent committees to trade crypto, then spent a week trying to break them. The interesting part is the one thing none of it tested.

A writeup of what broke under honest stress-testing, what survived, and the layer none of it touched — the mechanical-rule edge that's the actual point.

The setup

I built a system where committees of LLM agents paper-trade crypto perpetuals. Each "niche" — technicals, funding, derivatives, order flow, sentiment, market structure, and a few more — is its own agent with its own data slice. They vote; a bull and bear debate; a trader agent decides; sizing/exit/risk agents manage the position. Every agent is a separate model call, run in parallel, so they reason independently rather than collapsing into one prompt's groupthink.

On top of that I added the part I was most excited about: evolution. Within each niche, multiple variants compete (same data source, different lookback parameters). They're scored by their information coefficient (IC) — the rank correlation between their signal and the realized forward return — and the best one "champions" the vote. Across niches, weights are set dynamically by IC. Weak variants get culled; strong ones breed mutated offspring. Survival of the fittest, applied to a trading committee.

It's a fun architecture. The honest question is whether any of it helps. This post is about how I tried to answer that, and what I found — which is mostly "no," in increasingly interesting ways.

Read this before the findings, or you'll read them wrong: none of the tests below included a real trading edge. Not one injected a mechanical rule, a private factor, a thesis — the layer that's actually supposed to make money. Everything here is the bare scaffolding running on generic, public signals. I was stress-testing the chassis, never the engine. So when something fails below, it means "the empty harness has no alpha" — not "this can't work." The part that could work — the soul of the system — was never on the table, and its ceiling is completely untested.

Rule zero: don't trust your own backtest

The single most common way to fool yourself in this field is a backtest that looks great and means nothing. So before believing anything, two guardrails:

No lookahead, walk-forward. At decision time, record (variant, signal, price, t). Only score it H bars later, against the realized future price. Nothing about the future touches the decision.
Isolate the mechanism from the LLM. To test whether the evolution helps, I replaced every LLM agent with a deterministic numeric signal (a fixed formula from the same features). This removes two confounds at once: LLM cost (you can't run millions of model calls across a grid), and — more importantly — LLM training-data leakage (a model may "remember" past crypto prices and events; a formula can't). If the mechanism has value, it should show up even with dumb deterministic inputs.

With that harness, on ~16 months of hourly data across the major perps, here's what came out.

Finding 1: the evolution overfits. Equal-weight beats it.

I compared three things on the same signals: equal weight with no breeding (the dumb baseline); dynamic IC weighting with no breeding; and dynamic IC weighting plus breeding (the full mechanism).

The result was monotonic and consistent across coins: the more "intelligent" the mechanism, the worse the out-of-sample result. Breeding produced the lowest IC and the worst Sharpe; dynamic weighting was no better than equal weight; plain equal weighting won. The breeder reliably selected lookback variants that fit recent noise and generalized worse — textbook evolutionary overfitting.

I spent an overnight run searching the regularizers (sample floor, promotion margin, breeding cadence) trying to fix it. The best you can do is slow the breeding down enough that it stops actively hurting and merely matches equal weight. You cannot make it add value. On thin, noisy signals, the adaptive estimate is itself too noisy to weight on — so adaptation chases noise, and the no-estimation baseline wins.

Finding 2: IC and PnL can point in opposite directions

This one surprised me. I decomposed each strategy into gross (pre-cost) and net Sharpe, and looked at the sign of IC vs the sign of gross return:

Momentum: negative rank-IC, yet positive gross Sharpe.
Mean-reversion: positive rank-IC, yet negative gross Sharpe.

How? Rank-IC weights every bar equally — it's democratic. Sharpe is dominated by the few large moves — it's tail-weighted. Momentum is wrong on many small bars but catches the fat-tailed trends. Mean-reversion is right on many small bars and gets destroyed by those same tails.

The implication is brutal for the whole design: the mechanism's fitness function is IC. When IC and PnL disagree in sign, maximizing IC steers the committee toward the factor that is gross-unprofitable. The "smart" thing was optimizing the wrong objective. (And no, switching the fitness to realized PnL didn't save it — the rolling PnL estimate is even noisier, and equal-weight still beat it. The problem isn't which metric; it's that on thin alpha, any per-component fitness is too noisy to weight on.)

Finding 3: across ~94 factors, exactly one survives a strict holdout

Maybe the mechanism failed because I fed it junk — correlated price factors with no real edge. Fair. So I took the whole factor library (~94 cross-sectional factors: momentum, vol, range, funding, OI, order flow, residual/beta, technicals, news sentiment) and ran each through a proper gate.

"Proper" is the operative word, and here's where I caught myself cheating. My first holdout learned each factor's profitable sign on one set of coins and tested on another — same time period. Because a factor's sign is consistent across coins, the sign transferred trivially, and the test reported |in-sample Sharpe|. It "found" 16 independent edges. The tell: factors with negative full-sample Sharpe were showing positive "holdout" Sharpe, and the positive-rate was 1.00 across the board. A real out-of-sample test scatters around 0.5.

I tightened it to learn the sign on an early time window and validate on a locked later window. Now it "found" 13. Better, but still dominated by the low-vol/range/beta cluster — which I already knew, from prior work, are beta proxies that collapse under a strict coin-held-out test.

So I stopped hand-rolling and ran the candidates through the platform's gold-standard gate: select a decorrelated factor set on group-A coins + train window, freeze it, then validate on group-B coins that never touched selection and a time-locked window, with a bootstrap CI on the locked Sharpe. Four configurations. All four failed. In-sample Sharpe of 1.4–1.7 collapsed to −0.4 on held-out coins, CI comfortably spanning zero.

The progression is the whole point: 16 → 13 → 0. The number of "edges" went to zero exactly as the holdout got honest. Of everything in the library, only one factor — funding crowding-reversal, traded slowly — has ever survived this gate, and it's a modest ~0.7 Sharpe that barely clears the bar.

What it all means

Put together:

The evolutionary weighting/breeding — the "intelligence" of the system — is net-harmful on thin, noisy signals, regardless of whether the signal is junk or a genuine edge. Equal weight is the robust default.
The reason generalizes: an adaptive weighting scheme only adds value when each component's skill can be estimated with low variance relative to the spread of true skill. On thin alpha that condition fails, and a non-adaptive baseline dominates.
And separately: in this data (price/volume + funding + derivatives), there is almost no diverse, real, cross-sectional edge to begin with. One lonely survivor.

So the bottleneck was never the mechanism. It's the lack of diverse, genuine edges to feed it. A smarter committee can't manufacture alpha that isn't in the inputs. You need new information — on-chain flows, liquidations, term structure, things outside the price tape — before the orchestration layer has anything worth orchestrating.

And there's a blunt constraint on every result above: it ran only on the data I'd already wired in — price, volume, funding, basic derivatives. The genuinely high-value feeds — liquidation heatmaps, on-chain whale flows, term structure, order-book depth — simply aren't in that set yet. So these findings are as much a verdict on my current data subscriptions as on the factors themselves; I haven't even shown most of the interesting inputs to the system. The architecture is deliberately open about this: you can add any whitelisted external data source, turn its metrics into factors, and push them through the exact same honest gates. That's how the search for a real edge is meant to widen — far past the handful of feeds I happened to start with.

The reframe that makes this a product, not a graveyard

Here's the thing every test above shares: it ran on the bare system. None of them injected a human's actual edge — a real mechanical rule, a private factor, a thesis. By design, the platform is bring-your-own-edge: it's a transparent harness — independent agents, no-lookahead evaluation, an honest scoreboard — for you to plug your own rules and factors into and see if they survive the same gates that killed everything above.

Read that way, the negative results aren't "AI can't trade." They're "the empty harness has no alpha, exactly as it should." The harness's job isn't to be smart; it's to be honest — to tell you, ruthlessly, whether the thing you believe in actually holds up out of sample. Most things won't. That's not a bug; that's the most useful thing a trading tool can do for you.

That's also why I'm writing this instead of a launch post. The interesting asset here isn't a money printer. It's a discipline: walk-forward, coin-and-time double holdout, multiple-testing awareness, and a willingness to publish the runs that said "no" — including the one where my own broken gate told me I'd found 16 edges, and I almost believed it.

And here's why none of this leaves me discouraged: every negative result above is a property of the empty harness — not a verdict on what a real edge can do inside it. The hard, unglamorous part — an evaluation that refuses to lie to you — is built. The part with no ceiling — what happens when someone plugs in a mechanical rule or a private factor that actually holds up out of sample — hasn't even started. That's not a dead end. It's a blank canvas with an honest referee already standing on it.

If you want to poke holes in the evaluation, or bring a real edge and watch it get stress-tested, it's a transparent, paper-only, non-custodial arena. And if you've found a way to gate against overfitting that's stricter than coin-and-time double holdout, I genuinely want to hear it.

Open the arena →

Stack, for the curious: FastAPI + vanilla JS, DeepSeek / NVIDIA NIM for the agents, pandas for the evaluation. Everything above is paper-trading and research — not investment advice.

Replies

Loading…