Karpathy’s autonomous experimentation loop — applied to Google Tag Manager configs instead of neural nets. Modify container → deploy to staging → measure → keep or revert → repeat.
| Phase | Trigger | Model | Purpose |
|---|---|---|---|
| Exploration | score < 0.92 | Claude Sonnet | Broad mutation coverage at low cost |
| Escalation | first time score ≥ 0.92 | (switch event) | Reset plateau counter, announce swap |
| Refinement | score ≥ 0.92 thereafter | Claude Opus 4.6 | Deeper reasoning to squeeze last points |
3 consecutive rounds after escalation. Hitting 0.92 on Sonnet does not stop the run — it escalates.MAX_ROUNDS = 30MAX_REGRESSIONS = 3 revertsMAX_JSON_FAILURES = 5 invalid mutations from the active modelRoundRecord to KV; full run writes a RunManifest.program.md is the only human input — clients, strategy, and constraints are all declarative there.Karpathy’s autoresearch treats ML training as an optimization problem an agent can iterate on. GTM AutoResearch borrows the exact structure — swap train.py for a GTM container JSON, swap val_bpb for a signal quality score, and let a two-tier Sonnet→Opus 4.6 model stack mutate.
| Karpathy autoresearch | GTM AutoResearch |
|---|---|
| train.py | container JSON |
| val_bpb (bits per byte) | signal quality score |
| model architecture mutation | tag / trigger / variable mutation |
| fixed 5-min training budget | fixed 5-min training budget |
| validation split | 24-hr signal window |
| program.md → skill file | program.md → SKILL.md |
program.md encodes domain rules into SKILL.md the agent reads each round.A structural evaluator that reduces a GTM container to a single number. Nine weighted dimensions — each answers a question a senior tracking engineer would ask in a code review.
Three headline commands drive the system: run the loop, evaluate a template, or hydrate a template with client-specific values. All via npx tsx — no build step.
| Script | Purpose |
|---|---|
| npm run gtm-loop | run the GTM experimentation loop |
| npm run eval:gtm | score a single container JSON |
| npm run discover | discover Claude Code session logs |
| npm run extract | extract tool / MCP / package signals |
| npm run enrich | enrich signals with context |
| npm run actor | fire the Apify plugin watcher |
| npm run analyze | adjacency gap analysis |
| npm run generate | generate Obsidian experiment notes |
| npm run loop | the full self-improvement engine loop |
| npm run watch | fswatch-driven autorun on log changes |
| npm run typecheck | tsc --noEmit |
gtm-loop is the headline. The other scripts power the broader auto-research engine that watches Claude Code sessions and feeds it experiment ideas — see the Pipeline tab.Beyond the core loop, experiment outputs feed a full fine-tune pipeline that produces client-specialized LLMs. Each night’s winning configs become training data for a brain that knows your tracking stack.
ExperimentRecord — one row per roundINSERT OR IGNORE for idempotencyAccountState blob the agent conditions on each round| Track A | Track B | |
|---|---|---|
| Where | OpenAI cloud | Local Ollama, M3 Ultra |
| Good for | best quality | privacy + zero cost per token |
| Models | gpt-4o-mini / 4o | llama3 / qwen2.5-coder |
| Registry | shared model registry with versioned tags | shared model registry with versioned tags |
:18789 and routes GTM-related prompts to the client-specialized brainRun it overnight. Morning deliverables: a staging workspace ready to publish, a versioned config in R2, and a full experiment log. ~100 experiments over a weekend — never publishes live.
| Deliverable | What you get |
|---|---|
| Staging workspace | GTM workspace with winning config — one-click publish when you’re ready |
| Versioned JSON | winning-config.json stored in R2 — rollback to any previous night’s best |
| Experiment log | every patch tested, scored, kept or reverted — full audit trail with diffs |
| Playwright QA | each experiment validated in staging preview — tag firing, params, dedup all checked |
TypeScript-first, minimal dependencies, no build step. The loop is small enough to read in one sitting — that’s a feature.
| Package | Purpose |
|---|---|
| zod ^3.22 | ExperimentRecord + AccountState schemas |
| dotenv ^16.4 | .env loader |
| chalk ^5.3 | terminal colors for loop output |
| ora ^8.0 | spinners for long-running rounds |
| tsx ^4.11 (dev) | run .ts directly, no build |
| typescript ^5.4 (dev) | typecheck only (tsc --noEmit) |
Claude Sonnet explores while score < 0.92. First cross of the threshold escalates one-way to Claude Opus 4.6 for refinement. Plateau stop only fires on Opus.
Read / write staging workspace, apply container JSON, never publish to live.
Versioned storage for winning-config.json — one object per night.
Webhook receiver for Apify actor completions — feeds the auto-research engine.
Plugin marketplace watcher. REST only, no SDK pulled in.
Routes prompts to the fine-tuned client brain once Phase 5 lands.
npx tsx scripts/<name>.tsdata/errors/{timestamp}.log, never crash silently[Phase0], [Phase1], etc.data/signals/run-history.jsonscripts/run-all.sh chains the full pipeline in orderClone, configure, and let it loop overnight. This guide itself is a single-file HTML deploy — the pattern scales down to docs and up to the loop runner.
:18789 and is the request-routing entry pointindex.html, one Wrangler deploy. Same ethos as the loop itself — small, legible, cheap to run.The workspace ID should always be a staging name. Publishing stays manual.
Retain N nights so regressions can be bisected. Cheap storage, expensive lessons.
Meta Ads alignment is weighted x2. Adjust if your revenue mix differs.
~$0.60/client on Sonnet, plus a short Opus 4.6 tail after escalation. Set a monthly cap on your Anthropic key.
5 minutes of human review catches anything the 9-dim scorer doesn’t see.
Clone shopify-ecom-web.json as a starting point for new verticals.