AUTONOMOUS RESEARCH // TRACKING OPTIMIZATION

GTM AutoResearch

Karpathy’s autonomous experimentation loop — applied to Google Tag Manager configs instead of neural nets. Modify container → deploy to staging → measure → keep or revert → repeat.

Two-tier model strategy: Claude Sonnet drives exploration while score < 0.92. The first time score crosses ≥ 0.92 the loop escalates one-way to Claude Opus 4.6 for the remaining rounds. Plateau detection only begins after escalation.

Sonnet explores

Opus 4.6 refines

9-dim scorer

Never live-publishes

~$0.60 / client (Sonnet) + Opus tail

Playwright QA

feature/finetune-pipeline

System Diagram

┌─────────────────────────────────────────────────────────────────────────────┐ │ OPERATOR (you) │ │ npx tsx scripts/run-gtm-loop.ts │ └──────────────────────────────────┬────────────────────────────────────────────┘ │ reads ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ program.md (human input) │ │ DOCUMENTATION/loops/gtm-autoresearch/program.md │ │ ──────────────────────────────────────────────────── │ │ • clients[] → id, template, meta snapshot, eval path │ │ • strategyOrder[] (mutation priorities) │ │ • constraints[] (never-break rules) │ └──────────────────────────────────┬────────────────────────────────────────────┘ │ parsed into ProgramConfig ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ RUN-GTM-LOOP (orchestrator) │ │ scripts/run-gtm-loop.ts │ │ │ │ for each client: │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ ROUND LOOP (max 30 rounds) │ │ │ │ │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │ │ │ │ │ SCORE │──▶│ PROMPT │──▶│ MUTATE │──▶│ VALIDATE │──┐ │ │ │ │ │ (eval) │ │ (build) │ │ (Claude) │ │ + keep/rev │ │ │ │ │ │ └──────────┘ └──────────┘ └────┬─────┘ └────────────┘ │ │ │ │ │ ▲ │ │ │ │ │ │ │ ▼ │ │ │ │ │ │ ┌──────────────────────┐ │ │ │ │ │ │ │ MODEL ROUTER │ │ │ │ │ │ │ │ score < 0.92: │ │ │ │ │ │ │ │ → Claude Sonnet │ │ │ │ │ │ │ │ first cross of 0.92:│ │ │ │ │ │ │ │ → ESCALATE │ │ │ │ │ │ │ │ score ≥ 0.92: │ │ │ │ │ │ │ │ → Claude Opus 4.6 │ │ │ │ │ │ │ └──────────────────────┘ │ │ │ │ │ └──────────────── next round ◀──────────────────┘ │ │ │ │ │ │ │ │ Stop when (after escalation): │ │ │ │ score ≥ 0.92 for 3× on Opus │ 30 rounds │ 3 regressions │ │ │ └─────────────────────────────────────────────────────────────────┘ │ └──┬────────────┴───────────────┴────────────────┴──────────────────┴──────────────────────┘ │ │ │ │ │ │ invokes │ loads │ loads │ spawns │ writes ▼ ▼ ▼ ▼ ▼ ┌────────┐ ┌──────────┐ ┌────────────┐ ┌─────────────────┐ ┌──────────────┐ │ EVAL │ │ TEMPLATE │ │ META ADS │ │ TWO-TIER MODEL │ │ KV / R2 │ │ (eval) │ │ (JSON) │ │ SNAPSHOT │ │ (subprocess) │ │ store │ │ │ │ │ │ (JSON) │ │ │ │ │ │ evals/ │ │ content/ │ │ data/ │ │ < 0.92: │ │ scripts/lib/ │ │ eval_ │ │ clients/ │ │ clients/ │ │ Claude Sonnet │ │ kv-store.ts │ │ gtm_ │ │ <id>/ │ │ <id>/ │ │ ≥ 0.92: │ │ │ │ signal_│ │ shopify- │ │ meta-ads- │ │ Claude Opus │ │ • seed │ │ quality│ │ ecom-web │ │ snapshot │ │ 4.6 │ │ • rounds │ │ .ts │ │ .json │ │ .json │ │ │ │ • manifest │ │ │ │ │ │ │ │ CLAUDE_PATH= │ │ │ │ │ │ │ │ │ │ ~/.local/bin/ │ │ │ │ │ │ │ │ │ │ claude │ │ │ │ │ │ │ │ │ │ mutates 1 file: │ │ │ │ │ │ │ │ │ │ container JSON │ │ │ └────────┘ └──────────┘ └────────────┘ └─────────────────┘ └──────┬───────┘ │ │ │ returns GtmSignalQualityResult │ CF R2 │ { score, dimensions{9}, issues[] } │ winning- ▼ │ config ┌──────────────────────────┐ │ .json │ 9-DIMENSION SCORER │ ▼ │ ─────────────────────── │ ┌──────────────┐ │ 1. Tag coverage │ │ MORNING │ │ 2. Param completeness │ │ DELIVER- │ │ 3. Deduplication │ │ ABLES │ │ 4. Consent Mode v2 │ │ ──────────── │ │ 5. Naming conventions │ │ • staging │ │ 6. Variable hygiene │ │ workspace │ │ 7. Trigger quality │ │ • versioned │ │ 8. Folder organization │ │ JSON in R2 │ │ 9. Meta Ads alignment │ │ • experiment │ │ (weighted by $) │ │ log │ └──────────────────────────┘ │ • Playwright │ │ QA report │ └──────────────┘

Per-round Data Flow

container.json ──▶ [EVAL] ──▶ score + issues │ ▼ strategyOrder + constraints + issues │ ▼ [BUILD PROMPT] (mutation budget: 3 edits) │ ▼ ┌──────────────────────────┴──────────────────────────┐ │ [MODEL ROUTER] │ │ currentModel === "sonnet" && score < 0.92 │ │ └──▶ Claude Sonnet (exploration) │ │ score ≥ 0.92 on first cross │ │ └──▶ ESCALATE — switch to Opus 4.6 │ │ currentModel === "opus" │ │ └──▶ Claude Opus 4.6 (escalation) │ └──────────────────────────┬──────────────────────────┘ │ ▼ [model subprocess] ── returns JSON Patch │ ▼ [RE-EVAL on mutated container] │ ┌────────────┴────────────┐ ▼ ▼ score improved score dropped │ │ ▼ ▼ KEEP patch REVERT to prior write round rec increment regression │ │ └────────────┴────────────┘ ▼ putRound(KV) → next

Model Escalation — Quick Reference

Phase	Trigger	Model	Purpose
Exploration	score < 0.92	Claude Sonnet	Broad mutation coverage at low cost
Escalation	first time score ≥ 0.92	(switch event)	Reset plateau counter, announce swap
Refinement	score ≥ 0.92 thereafter	Claude Opus 4.6	Deeper reasoning to squeeze last points

Stop Conditions

Plateau (Opus only): score ≥ 0.92 for 3 consecutive rounds after escalation. Hitting 0.92 on Sonnet does not stop the run — it escalates.
Budget: MAX_ROUNDS = 30
Instability: MAX_REGRESSIONS = 3 reverts
Parse failures: MAX_JSON_FAILURES = 5 invalid mutations from the active model

Invariants

Loop never publishes to live GTM — staging workspace only.
Every mutation is idempotent: re-running doesn’t duplicate tags/triggers.
Each round writes a RoundRecord to KV; full run writes a RunManifest.
program.md is the only human input — clients, strategy, and constraints are all declarative there.
Escalation is one-way per run — once on Opus, stays on Opus.

Explore the Guide

Loop

The five-step experimentation cycle. Fixed windows, single metric, agent mutates one file.

karpathytrain.py

→

Scorer

The 9 structural dimensions that turn container quality into one number.

signal qualityeval

→

Usage

Three commands: run the loop, run the eval, hydrate a template.

tsxnpm

→

Pipeline

Six phases from experiment log to client-specialized fine-tuned brain.

fine-tuneflywheel

→

Results

Overnight deliverables: staging workspace, versioned JSON in R2, full audit log.

stagingR2

→

Stack

TypeScript, Zod, SQLite, Chroma, Playwright — with Sonnet→Opus 4.6 as the mutator.

typescriptzod

→

Deploy

Self-host the loop — env setup, OpenClaw integration on :18789, R2 versioning.

cloudflareopenclaw

→

Quick Links

Repository

github.com/Organized-AI

Karpathy’s autoresearch

same loop, different domain

Pipeline docs

6-phase deep dive

organizedai.vip

ecosystem home

// CORE IDEA

The Experimentation Loop

Karpathy’s autoresearch treats ML training as an optimization problem an agent can iterate on. GTM AutoResearch borrows the exact structure — swap train.py for a GTM container JSON, swap val_bpb for a signal quality score, and let a two-tier Sonnet→Opus 4.6 model stack mutate.

The Five Steps

1. Modify configSonnet (or Opus 4.6 post-escalation) mutates up to 3 edits

2. ValidateJSON Patch applied to staging container — never live

3. Re-score9-dim structural eval on the mutated JSON

4. Keep / revertif score improves, keep the diff; otherwise roll back + regression++

5. Repeat / escalatefirst cross of 0.92 swaps the model to Opus 4.6

Why It’s the Same Loop

Karpathy autoresearch	GTM AutoResearch
train.py	container JSON
val_bpb (bits per byte)	signal quality score
model architecture mutation	tag / trigger / variable mutation
fixed 5-min training budget	fixed 5-min training budget
validation split	24-hr signal window
program.md → skill file	program.md → SKILL.md

Why it works: standardized measurement windows mean every variation is directly comparable. One number, one file of truth, no operator guessing which experiment “felt” better.

The Agent’s Contract

One file to modify — the container JSON. Nothing else.
One metric to beat — signal quality score. No vanity metrics.
One directive — program.md encodes domain rules into SKILL.md the agent reads each round.
No live traffic — all changes land in a staging workspace. A human publishes when ready.

The Round

# one round, simplified — two-tier model routing const before = await score(config) const model = currentModel // "sonnet" | "opus" const patch = await claude(model, promptFrom(config, skill, issues)) const after = await score(apply(config, patch)) if (after > before) { keep(patch); log("kept", after) } else { revert(); log("revert", before) } // one-way escalation: Sonnet → Opus 4.6 if (model === "sonnet" && after >= 0.92) { currentModel = "opus" plateauCount = 0 // reset; plateau only counts on Opus }

// SIGNAL QUALITY

The 9-Dimension Scorer

A structural evaluator that reduces a GTM container to a single number. Nine weighted dimensions — each answers a question a senior tracking engineer would ask in a code review.

Why structural: live signal comparisons take days. Structural scoring runs in seconds on the JSON alone — fast enough for 100 experiments over a weekend.

The Nine Dimensions

1. Tag coverageecom events + infra tags present

2. Parameter completenessrequired params populated on every tag

3. Deduplicationevent ID generator configured and wired

4. Consent Mode v2GCS/GCD signals plumbed correctly

5. Naming conventionstags / triggers / variables follow house rules

6. Variable hygieneno orphans, no duplicates, typed cleanly

7. Trigger qualityspecific, non-overlapping, correctly scoped

8. Folder organizationtags grouped by purpose, not dumped flat

9. Meta Ads alignmentweighted by conversion value — biggest lever

Running the Eval Standalone

# score a single template without running the full loop npx tsx evals/eval_gtm_signal_quality.ts \ content/gtm-templates/shopify-ecom-web.json # output tag_coverage 0.88 param_completeness 0.71 deduplication 1.00 consent_mode_v2 0.60 naming 0.92 variable_hygiene 0.83 trigger_quality 0.77 folder_org 0.95 meta_ads_alignment 0.65 // weighted x2 overall: 0.79

How the Score Drives the Agent

Each round, the lowest-scoring dimension is surfaced to the mutation prompt as a target.
The active model (Sonnet while score < 0.92, Opus 4.6 after escalation) proposes a minimal patch aimed at that target — never sweeping rewrites.
If the patch lifts the overall score, it’s kept. If it regresses a different dimension, it’s reverted.
The agent sees the full 9-tuple after each round — the feedback loop is tight and interpretable.

// COMMANDS

Usage

Three headline commands drive the system: run the loop, evaluate a template, or hydrate a template with client-specific values. All via npx tsx — no build step.

Run the Loop

# the overnight run npx tsx scripts/run-gtm-loop.ts # or via the npm script alias npm run gtm-loop

Run the Eval Standalone

# score one container without mutating it npx tsx evals/eval_gtm_signal_quality.ts \ content/gtm-templates/shopify-ecom-web.json

Hydrate a Template

# inject client values into a template scaffold npx tsx scripts/hydrate-gtm-template.ts client-config.json

All npm Scripts

Script	Purpose
npm run gtm-loop	run the GTM experimentation loop
npm run eval:gtm	score a single container JSON
npm run discover	discover Claude Code session logs
npm run extract	extract tool / MCP / package signals
npm run enrich	enrich signals with context
npm run actor	fire the Apify plugin watcher
npm run analyze	adjacency gap analysis
npm run generate	generate Obsidian experiment notes
npm run loop	the full self-improvement engine loop
npm run watch	fswatch-driven autorun on log changes
npm run typecheck	tsc --noEmit

Note: gtm-loop is the headline. The other scripts power the broader auto-research engine that watches Claude Code sessions and feeds it experiment ideas — see the Pipeline tab.

Cost Calibration

Up to 30 rounds per client, mutation budget ~3 edits / round
Claude Sonnet drives exploration while score < 0.92: ~$0.60 per full client run
Claude Opus 4.6 runs only after escalation — typically the last few rounds to refine past the threshold
Escalation is one-way per run; plateau stop (≥ 0.92 × 3) only fires on Opus

// FINE-TUNE PIPELINE

Six-Phase Pipeline

Beyond the core loop, experiment outputs feed a full fine-tune pipeline that produces client-specialized LLMs. Each night’s winning configs become training data for a brain that knows your tracking stack.

The compounding move: loop outputs become training data. Training data becomes a fine-tuned client brain. The brain makes next week’s experiments smarter. This is the flywheel.

The Six Phases

Phase 1 — Experiment LoggerZod + SQLite WAL + idempotent writes

Phase 2 — Account State Collectorfull AccountState via MCP tool calls

Phase 3 — JSONL Training Datascore filter + Chroma dedup + quality gates

Phase 4 — Fine-Tune RunnerTrack A (OpenAI cloud) vs Track B (Ollama local)

Phase 5 — OpenClaw Client Brainrequest routing through OpenClaw :18789

Phase 6 — The Flywheelwatcher events + drift detection + auto-rollback

Phase 1: Experiment Logger

Schema: Zod-validated ExperimentRecord — one row per round
Storage: SQLite with WAL mode, INSERT OR IGNORE for idempotency
CLI: export / count / import for moving experiments between environments

Phase 2: Account State Collector

Inputs: GTM containers, Google Ads accounts, Meta (Pipeboard) ad accounts
Method: MCP tool call map rendered into a system prompt
Output: a single AccountState blob the agent conditions on each round

Phase 3: JSONL Training Data

TRAINING DATA BUILDER experiments.sqlite [Phase 1] │ ▼ ┌────────────┐ │ score filter │ keep only rounds above threshold └────┬─────┘ ▼ ┌────────────┐ │ Chroma dedup │ drop near-duplicate patches └────┬─────┘ ▼ training.jsonl → Phase 4

Phase 4: Fine-Tune Runner — Dual Track

	Track A	Track B
Where	OpenAI cloud	Local Ollama, M3 Ultra
Good for	best quality	privacy + zero cost per token
Models	gpt-4o-mini / 4o	llama3 / qwen2.5-coder
Registry	shared model registry with versioned tags	shared model registry with versioned tags

Phase 5: OpenClaw Client Brain

OpenClaw listens on :18789 and routes GTM-related prompts to the client-specialized brain
Middleware stack handles auth, logging, and fallback to a generalist model if the fine-tune is cold
Same request shape as a normal LLM call — the brain swap is invisible to callers

Phase 6: The Flywheel

Watcher events: loop completions trigger a lightweight rebuild check
Drift detection: if the live client brain starts scoring below a new generalist baseline, auto-rollback
Compounding: every night’s winners improve next week’s starting prompts

// MORNING DELIVERABLES

What You Wake Up To

Run it overnight. Morning deliverables: a staging workspace ready to publish, a versioned config in R2, and a full experiment log. ~100 experiments over a weekend — never publishes live.

The operator contract: the loop never touches production. A human reviews the winner and clicks publish. Safety is architectural, not a flag you could forget.

The Four Deliverables

Deliverable	What you get
Staging workspace	GTM workspace with winning config — one-click publish when you’re ready
Versioned JSON	winning-config.json stored in R2 — rollback to any previous night’s best
Experiment log	every patch tested, scored, kept or reverted — full audit trail with diffs
Playwright QA	each experiment validated in staging preview — tag firing, params, dedup all checked

Morning Ritual

# morning deliverables const workspace = await gtm.getWorkspace("autoresearch-nightly") // review what changed console.log(workspace.changelog) // → 14 tags modified // → score: 0.72 → 0.91 // → 47 experiments run // happy? one-click publish await gtm.publishWorkspace(workspace) // or grab the JSON for review const config = await r2.get("winning-config.json") await gtm.importContainer(config)

Typical Weekend Outcome

Rounds run~100

Rounds kept~25 (quality gate rejects ~75%)

Score lift0.72 → 0.91 typical

Tags modified10–20 across consent, dedup, params

Total cost~$0.60 / client on Sonnet + Opus 4.6 tail after escalation

Operator time5 minutes: review changelog, publish

Full Audit Trail

Every round’s before / after container JSONs stored alongside the score delta
Every rejected patch logged with its regression dimension
Diffs viewable per round — nothing disappears silently
R2 keeps N nights of winners so you can bisect regressions across weeks

// IMPLEMENTATION

Stack & Conventions

TypeScript-first, minimal dependencies, no build step. The loop is small enough to read in one sitting — that’s a feature.

Runtime

TypeScript

tsx runner

Node 18+

no compile step

fswatch

Dependencies

Package	Purpose
zod ^3.22	ExperimentRecord + AccountState schemas
dotenv ^16.4	.env loader
chalk ^5.3	terminal colors for loop output
ora ^8.0	spinners for long-running rounds
tsx ^4.11 (dev)	run .ts directly, no build
typescript ^5.4 (dev)	typecheck only (`tsc --noEmit`)

Why so few deps: SQLite is stdlib-adjacent, MCP tool calls go via Claude Code, Cloudflare bits come from Wrangler. Less surface area means faster overnight reliability.

External Services

Anthropic API — Two-Tier

Claude Sonnet explores while score < 0.92. First cross of the threshold escalates one-way to Claude Opus 4.6 for refinement. Plateau stop only fires on Opus.

Google Tag Manager API

Read / write staging workspace, apply container JSON, never publish to live.

Cloudflare R2

Versioned storage for winning-config.json — one object per night.

Cloudflare Worker

Webhook receiver for Apify actor completions — feeds the auto-research engine.

Apify

Plugin marketplace watcher. REST only, no SDK pulled in.

OpenClaw (:18789)

Routes prompts to the fine-tuned client brain once Phase 5 lands.

Conventions

All scripts run via npx tsx scripts/<name>.ts
All outputs are idempotent — re-running never duplicates
Errors logged to data/errors/{timestamp}.log, never crash silently
Console logs use phase prefix: [Phase0], [Phase1], etc.
Every run writes a manifest to data/signals/run-history.json
scripts/run-all.sh chains the full pipeline in order

Repo Layout

# top-level content/ # GTM templates (shopify-ecom-web.json, …) data/ # SQLite, signals, error logs evals/ # eval_gtm_signal_quality.ts + others scripts/ # run-gtm-loop.ts, hydrate-gtm-template.ts, loops/ DOCUMENTATION/ # phase deep dives CLAUDE.md # agent conventions .env.example # required vars

// SELF-HOST

Deploy & Run

Clone, configure, and let it loop overnight. This guide itself is a single-file HTML deploy — the pattern scales down to docs and up to the loop runner.

1. Clone & Install

git clone https://github.com/Organized-AI/gtm-autoresearch cd gtm-autoresearch git checkout feature/finetune-pipeline npm install

2. Configure .env

cp .env.example .env # fill in: ANTHROPIC_API_KEY=sk-ant-… GTM_ACCOUNT_ID=… GTM_CONTAINER_ID=… GTM_WORKSPACE_ID=autoresearch-nightly R2_BUCKET=gtm-winners OBSIDIAN_VAULT_PATH=/path/to/vault

3. Typecheck & Dry Run

npm run typecheck # one eval to sanity-check the scorer npm run eval:gtm -- content/gtm-templates/shopify-ecom-web.json

4. Launch the Loop

# headless overnight run npm run gtm-loop # or the full auto-research pipeline (watches Claude Code logs too) npm run watch

Integration with OpenClaw

OpenClaw runs on port :18789 and is the request-routing entry point
Point Phase 5’s brain endpoint at OpenClaw so fine-tuned outputs serve GTM prompts automatically
Fallback middleware sends to a generalist model if the client brain is cold or drift-flagged

Deploying These Docs

# the pattern that deployed this page wrangler pages project create gtm-autoresearch-guide --production-branch=main wrangler pages deploy gtm-autoresearch \ --project-name=gtm-autoresearch-guide \ --branch=main \ --commit-dirty=true

Single-file docs: no build tooling, no framework. One index.html, one Wrangler deploy. Same ethos as the loop itself — small, legible, cheap to run.

Operational Checklist

Never publish live

The workspace ID should always be a staging name. Publishing stays manual.

Keep R2 history

Retain N nights so regressions can be bisected. Cheap storage, expensive lessons.

Watch the scorer weights

Meta Ads alignment is weighted x2. Adjust if your revenue mix differs.

Budget the two-tier spend

~$0.60/client on Sonnet, plus a short Opus 4.6 tail after escalation. Set a monthly cap on your Anthropic key.

Review changelog daily

5 minutes of human review catches anything the 9-dim scorer doesn’t see.

Fork the template

Clone shopify-ecom-web.json as a starting point for new verticals.