My AI trading committee is up 15.7% in 8 days. Here's why I still don't trust it.

The follow-up to "I stress-tested a self-evolving AI crypto trader" — this time the number is green, and the discipline matters more, not less.

The number

One of my arenas — a committee of LLM agents trading BTC perpetuals on 30-minute bars, paper-only — is up +15.7% in 8 days. Over the same window, BTC itself is down about 4%. The mechanical baseline (same entry/exit shell, same risk rules, but no committee judgment) is down −17.5%.

arena +15.7% · BTC −4% · judgment-free baseline −17.5% · 29 closed · 55% win rate · 8 days

So the committee didn't just ride the market. It was short through most of a drawdown, on a thesis it stated in plain language every cycle ("trend"), while the judgment-free version of itself was getting chopped to pieces. A 33-point spread over its own baseline in 8 days.

Every "AI trading" site on the internet would put that curve on a landing page. This post is about why I won't — and what the same dashboard says when you look at the parts that don't go up and to the right.

What actually produced the gain

Three honest observations before anyone gets excited:

1. It was mostly one call. The committee opened a short and held it for days through a falling market. The win rate across 29 closed trades is 55% — barely above a coin flip. The equity curve wasn't built by being right often; it was built by being right once, in size, for a long time. That's exactly the tail-weighted PnL structure I wrote about in the last post: profit concentrates in a few big moves. Which also means: one regime, one thesis, essentially one big data point.

2. Eight days is nothing. Twenty-nine closed trades is nothing. My own platform stamps this arena's record accordingly — the scoreboard literally shows an "immature" warning next to the return figure. I built that label to protect users from exactly the excitement I'm feeling right now.

3. The market handed it a trend. A trend-following committee in a trending week looks like a genius. The real test is what it does in the chop that killed its baseline twin — and eight days doesn't contain enough chop to know.

The part that surprised me

Here's the detail I find genuinely interesting, and it's not the headline number.

Running inside every arena is a live A/B experiment: the "evolved" committee (champion variants, dynamic IC weighting) against a plain equal-weight version of the same signals, scored walk-forward with no lookahead. In all my earlier mechanism research — and in the demo arena right now — equal weight wins. That was the central negative result of the last post: the "smart" layer usually overfits.

On this arena, as of today: evolved committee IC +0.036, equal-weight IC −0.026, on 146 matched samples. The smart layer is ahead — for the first time I've seen it.

Do I believe it? Not yet. An IC of 0.036 on 146 samples is within noise. It could easily flip next week. But this is exactly what an honest harness is for: I published "the evolution layer doesn't help" when the data said so, and I get to watch — in public, with no lookahead — whether this arena becomes the counterexample or regresses to the mean like everything else. Either outcome teaches something. That's the deal.

What would actually change my mind

I wrote down, before this run, what evidence would make me trust an arena. Nothing about that list changes because the curve is green:

Volume: 100+ closed trades, not 29.
Regimes: profits in trend and chop, not one directional week.
The A/B holding: the evolved layer still beating equal weight when the sample is 5× larger.
Costs: the edge surviving realistic fees and slippage (paper fills are kind).
No survivorship: this is one arena of several I run. Some of the others are flat or down. Picking the winner after the fact and telling its story is the oldest trick in the industry — consider this paragraph the disclosure.

Until then, +15.7% is a good week, not a track record.

It's live on the leaderboard — come watch it

This arena is public. Right now it sits at #1 on the leaderboard by copy-trading return — which I'm obliged to immediately deflate: the public leaderboard currently has four arenas on it, so "first place" is a modest podium. Still, everything above is verifiable there in real time: the trades, the committee's stated reasoning each cycle, the A/B experiment, the immature label — none of it is a screenshot I curated for this post.

If you want more than watching, it's open for copy-subscription ($9.9/mo): every open/close paper-copies to your own independent ledger, and if you really want to, you can route it to your own exchange keys — non-custodial, defaults to dry-run (no orders), and I'd say what I'd say to a friend: if you flip it to real, do it with an amount you'd be comfortable losing entirely, because eight days of green is not evidence. The subscription buys you a front-row seat to an experiment, not a yield.

Why publish it at all, then?

Because the alternative is worse. If I only publish negative results, that's its own bias — a kind of performative pessimism. The honest position is to publish the process: here's the number, here's the baseline it beat, here's the sample size, here's the label my own system puts on it, here's what would falsify my caution. The scoreboard is public and updates live; you don't have to take my word for where it goes from here.

You can watch this arena — judgment, trades, A/B experiment and all. If it falls apart in the chop, you'll see that too. That's the product.

Watch it live →

Paper-trading and research. Past performance — especially eight days of it — is not indicative of anything. Not investment advice.

Replies

Loading…