Overview
Loop
Scorer
Usage
Pipeline
Results
Stack
Deploy
GitHub organizedai.vip
AUTONOMOUS RESEARCH // TRACKING OPTIMIZATION

GTM AutoResearch

Karpathy’s autonomous experimentation loop — applied to Google Tag Manager configs instead of neural nets. Modify container → deploy to staging → measure → keep or revert → repeat.

Two-tier model strategy: Claude Sonnet drives exploration while score < 0.92. The first time score crosses ≥ 0.92 the loop escalates one-way to Claude Opus 4.6 for the remaining rounds. Plateau detection only begins after escalation.
Sonnet explores
Opus 4.6 refines
9-dim scorer
Never live-publishes
~$0.60 / client (Sonnet) + Opus tail
Playwright QA
feature/finetune-pipeline

System Diagram

┌─────────────────────────────────────────────────────────────────────────────┐ │ OPERATOR (you) │ │ npx tsx scripts/run-gtm-loop.ts │ └──────────────────────────────────┬────────────────────────────────────────────┘ │ reads ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ program.md (human input) │ │ DOCUMENTATION/loops/gtm-autoresearch/program.md │ │ ──────────────────────────────────────────────────── │ │ • clients[] → id, template, meta snapshot, eval path │ │ • strategyOrder[] (mutation priorities) │ │ • constraints[] (never-break rules) │ └──────────────────────────────────┬────────────────────────────────────────────┘ │ parsed into ProgramConfig ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ RUN-GTM-LOOP (orchestrator) │ │ scripts/run-gtm-loop.ts │ │ │ │ for each client: │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ ROUND LOOP (max 30 rounds) │ │ │ │ │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │ │ │ │ │ SCORE │──▶│ PROMPT │──▶│ MUTATE │──▶│ VALIDATE │──┐ │ │ │ │ │ (eval) │ │ (build) │ │ (Claude) │ │ + keep/rev │ │ │ │ │ │ └──────────┘ └──────────┘ └────┬─────┘ └────────────┘ │ │ │ │ │ ▲ │ │ │ │ │ │ │ ▼ │ │ │ │ │ │ ┌──────────────────────┐ │ │ │ │ │ │ │ MODEL ROUTER │ │ │ │ │ │ │ │ score < 0.92: │ │ │ │ │ │ │ │ → Claude Sonnet │ │ │ │ │ │ │ │ first cross of 0.92:│ │ │ │ │ │ │ │ → ESCALATE │ │ │ │ │ │ │ │ score ≥ 0.92: │ │ │ │ │ │ │ │ → Claude Opus 4.6 │ │ │ │ │ │ │ └──────────────────────┘ │ │ │ │ │ └──────────────── next round ◀──────────────────┘ │ │ │ │ │ │ │ │ Stop when (after escalation): │ │ │ │ score ≥ 0.92 for 3× on Opus │ 30 rounds │ 3 regressions │ │ │ └─────────────────────────────────────────────────────────────────┘ │ └──┬────────────┴───────────────┴────────────────┴──────────────────┴──────────────────────┘ │ │ │ │ │ │ invokes │ loads │ loads │ spawns │ writes ▼ ▼ ▼ ▼ ▼ ┌────────┐ ┌──────────┐ ┌────────────┐ ┌─────────────────┐ ┌──────────────┐ │ EVAL │ │ TEMPLATE │ │ META ADS │ │ TWO-TIER MODEL │ │ KV / R2 │ │ (eval) │ │ (JSON) │ │ SNAPSHOT │ │ (subprocess) │ │ store │ │ │ │ │ │ (JSON) │ │ │ │ │ │ evals/ │ │ content/ │ │ data/ │ │ < 0.92: │ │ scripts/lib/ │ │ eval_ │ │ clients/ │ │ clients/ │ │ Claude Sonnet │ │ kv-store.ts │ │ gtm_ │ │ <id>/ │ │ <id>/ │ │ ≥ 0.92: │ │ │ │ signal_│ │ shopify- │ │ meta-ads- │ │ Claude Opus │ │ • seed │ │ quality│ │ ecom-web │ │ snapshot │ │ 4.6 │ │ • rounds │ │ .ts │ │ .json │ │ .json │ │ │ │ • manifest │ │ │ │ │ │ │ │ CLAUDE_PATH= │ │ │ │ │ │ │ │ │ │ ~/.local/bin/ │ │ │ │ │ │ │ │ │ │ claude │ │ │ │ │ │ │ │ │ │ mutates 1 file: │ │ │ │ │ │ │ │ │ │ container JSON │ │ │ └────────┘ └──────────┘ └────────────┘ └─────────────────┘ └──────┬───────┘ │ │ │ returns GtmSignalQualityResult │ CF R2 │ { score, dimensions{9}, issues[] } │ winning- ▼ │ config ┌──────────────────────────┐ │ .json │ 9-DIMENSION SCORER │ ▼ │ ─────────────────────── │ ┌──────────────┐ │ 1. Tag coverage │ │ MORNING │ │ 2. Param completeness │ │ DELIVER- │ │ 3. Deduplication │ │ ABLES │ │ 4. Consent Mode v2 │ │ ──────────── │ │ 5. Naming conventions │ │ • staging │ │ 6. Variable hygiene │ │ workspace │ │ 7. Trigger quality │ │ • versioned │ │ 8. Folder organization │ │ JSON in R2 │ │ 9. Meta Ads alignment │ │ • experiment │ │ (weighted by $) │ │ log │ └──────────────────────────┘ │ • Playwright │ │ QA report │ └──────────────┘

Per-round Data Flow

container.json ──▶ [EVAL] ──▶ score + issues │ ▼ strategyOrder + constraints + issues │ ▼ [BUILD PROMPT] (mutation budget: 3 edits) │ ▼ ┌──────────────────────────┴──────────────────────────┐ │ [MODEL ROUTER] │ │ currentModel === "sonnet" && score < 0.92 │ │ └──▶ Claude Sonnet (exploration) │ │ score ≥ 0.92 on first cross │ │ └──▶ ESCALATE — switch to Opus 4.6 │ │ currentModel === "opus" │ │ └──▶ Claude Opus 4.6 (escalation) │ └──────────────────────────┬──────────────────────────┘ │ ▼ [model subprocess] ── returns JSON Patch │ ▼ [RE-EVAL on mutated container] │ ┌────────────┴────────────┐ ▼ ▼ score improved score dropped │ │ ▼ ▼ KEEP patch REVERT to prior write round rec increment regression │ │ └────────────┴────────────┘ ▼ putRound(KV) → next

Model Escalation — Quick Reference

PhaseTriggerModelPurpose
Explorationscore < 0.92Claude SonnetBroad mutation coverage at low cost
Escalationfirst time score ≥ 0.92(switch event)Reset plateau counter, announce swap
Refinementscore ≥ 0.92 thereafterClaude Opus 4.6Deeper reasoning to squeeze last points

Stop Conditions

  • Plateau (Opus only): score ≥ 0.92 for 3 consecutive rounds after escalation. Hitting 0.92 on Sonnet does not stop the run — it escalates.
  • Budget: MAX_ROUNDS = 30
  • Instability: MAX_REGRESSIONS = 3 reverts
  • Parse failures: MAX_JSON_FAILURES = 5 invalid mutations from the active model

Invariants

  • Loop never publishes to live GTM — staging workspace only.
  • Every mutation is idempotent: re-running doesn’t duplicate tags/triggers.
  • Each round writes a RoundRecord to KV; full run writes a RunManifest.
  • program.md is the only human input — clients, strategy, and constraints are all declarative there.
  • Escalation is one-way per run — once on Opus, stays on Opus.

Explore the Guide

Loop
The five-step experimentation cycle. Fixed windows, single metric, agent mutates one file.
karpathytrain.py
Scorer
The 9 structural dimensions that turn container quality into one number.
signal qualityeval
Usage
Three commands: run the loop, run the eval, hydrate a template.
tsxnpm
Pipeline
Six phases from experiment log to client-specialized fine-tuned brain.
fine-tuneflywheel
Results
Overnight deliverables: staging workspace, versioned JSON in R2, full audit log.
stagingR2
Stack
TypeScript, Zod, SQLite, Chroma, Playwright — with SonnetOpus 4.6 as the mutator.
typescriptzod
Deploy
Self-host the loop — env setup, OpenClaw integration on :18789, R2 versioning.
cloudflareopenclaw

Quick Links

Repository
github.com/Organized-AI
Karpathy’s autoresearch
same loop, different domain
Pipeline docs
6-phase deep dive
organizedai.vip
ecosystem home
// CORE IDEA

The Experimentation Loop

Karpathy’s autoresearch treats ML training as an optimization problem an agent can iterate on. GTM AutoResearch borrows the exact structure — swap train.py for a GTM container JSON, swap val_bpb for a signal quality score, and let a two-tier SonnetOpus 4.6 model stack mutate.

The Five Steps

1. Modify configSonnet (or Opus 4.6 post-escalation) mutates up to 3 edits
2. ValidateJSON Patch applied to staging container — never live
3. Re-score9-dim structural eval on the mutated JSON
4. Keep / revertif score improves, keep the diff; otherwise roll back + regression++
5. Repeat / escalatefirst cross of 0.92 swaps the model to Opus 4.6

Why It’s the Same Loop

Karpathy autoresearchGTM AutoResearch
train.pycontainer JSON
val_bpb (bits per byte)signal quality score
model architecture mutationtag / trigger / variable mutation
fixed 5-min training budgetfixed 5-min training budget
validation split24-hr signal window
program.md → skill fileprogram.md → SKILL.md
Why it works: standardized measurement windows mean every variation is directly comparable. One number, one file of truth, no operator guessing which experiment “felt” better.

The Agent’s Contract

  • One file to modify — the container JSON. Nothing else.
  • One metric to beat — signal quality score. No vanity metrics.
  • One directiveprogram.md encodes domain rules into SKILL.md the agent reads each round.
  • No live traffic — all changes land in a staging workspace. A human publishes when ready.

The Round

# one round, simplified — two-tier model routing const before = await score(config) const model = currentModel // "sonnet" | "opus" const patch = await claude(model, promptFrom(config, skill, issues)) const after = await score(apply(config, patch)) if (after > before) { keep(patch); log("kept", after) } else { revert(); log("revert", before) } // one-way escalation: Sonnet → Opus 4.6 if (model === "sonnet" && after >= 0.92) { currentModel = "opus" plateauCount = 0 // reset; plateau only counts on Opus }
// SIGNAL QUALITY

The 9-Dimension Scorer

A structural evaluator that reduces a GTM container to a single number. Nine weighted dimensions — each answers a question a senior tracking engineer would ask in a code review.

Why structural: live signal comparisons take days. Structural scoring runs in seconds on the JSON alone — fast enough for 100 experiments over a weekend.

The Nine Dimensions

1. Tag coverageecom events + infra tags present
2. Parameter completenessrequired params populated on every tag
3. Deduplicationevent ID generator configured and wired
4. Consent Mode v2GCS/GCD signals plumbed correctly
5. Naming conventionstags / triggers / variables follow house rules
6. Variable hygieneno orphans, no duplicates, typed cleanly
7. Trigger qualityspecific, non-overlapping, correctly scoped
8. Folder organizationtags grouped by purpose, not dumped flat
9. Meta Ads alignmentweighted by conversion value — biggest lever

Running the Eval Standalone

# score a single template without running the full loop npx tsx evals/eval_gtm_signal_quality.ts \ content/gtm-templates/shopify-ecom-web.json # output tag_coverage 0.88 param_completeness 0.71 deduplication 1.00 consent_mode_v2 0.60 naming 0.92 variable_hygiene 0.83 trigger_quality 0.77 folder_org 0.95 meta_ads_alignment 0.65 // weighted x2 overall: 0.79

How the Score Drives the Agent

  • Each round, the lowest-scoring dimension is surfaced to the mutation prompt as a target.
  • The active model (Sonnet while score < 0.92, Opus 4.6 after escalation) proposes a minimal patch aimed at that target — never sweeping rewrites.
  • If the patch lifts the overall score, it’s kept. If it regresses a different dimension, it’s reverted.
  • The agent sees the full 9-tuple after each round — the feedback loop is tight and interpretable.
// COMMANDS

Usage

Three headline commands drive the system: run the loop, evaluate a template, or hydrate a template with client-specific values. All via npx tsx — no build step.

Run the Loop

# the overnight run npx tsx scripts/run-gtm-loop.ts # or via the npm script alias npm run gtm-loop

Run the Eval Standalone

# score one container without mutating it npx tsx evals/eval_gtm_signal_quality.ts \ content/gtm-templates/shopify-ecom-web.json

Hydrate a Template

# inject client values into a template scaffold npx tsx scripts/hydrate-gtm-template.ts client-config.json

All npm Scripts

ScriptPurpose
npm run gtm-looprun the GTM experimentation loop
npm run eval:gtmscore a single container JSON
npm run discoverdiscover Claude Code session logs
npm run extractextract tool / MCP / package signals
npm run enrichenrich signals with context
npm run actorfire the Apify plugin watcher
npm run analyzeadjacency gap analysis
npm run generategenerate Obsidian experiment notes
npm run loopthe full self-improvement engine loop
npm run watchfswatch-driven autorun on log changes
npm run typechecktsc --noEmit
Note: gtm-loop is the headline. The other scripts power the broader auto-research engine that watches Claude Code sessions and feeds it experiment ideas — see the Pipeline tab.

Cost Calibration

  • Up to 30 rounds per client, mutation budget ~3 edits / round
  • Claude Sonnet drives exploration while score < 0.92: ~$0.60 per full client run
  • Claude Opus 4.6 runs only after escalation — typically the last few rounds to refine past the threshold
  • Escalation is one-way per run; plateau stop (≥ 0.92 × 3) only fires on Opus
// FINE-TUNE PIPELINE

Six-Phase Pipeline

Beyond the core loop, experiment outputs feed a full fine-tune pipeline that produces client-specialized LLMs. Each night’s winning configs become training data for a brain that knows your tracking stack.

The compounding move: loop outputs become training data. Training data becomes a fine-tuned client brain. The brain makes next week’s experiments smarter. This is the flywheel.

The Six Phases

Phase 1 — Experiment LoggerZod + SQLite WAL + idempotent writes
Phase 2 — Account State Collectorfull AccountState via MCP tool calls
Phase 3 — JSONL Training Datascore filter + Chroma dedup + quality gates
Phase 4 — Fine-Tune RunnerTrack A (OpenAI cloud) vs Track B (Ollama local)
Phase 5 — OpenClaw Client Brainrequest routing through OpenClaw :18789
Phase 6 — The Flywheelwatcher events + drift detection + auto-rollback

Phase 1: Experiment Logger

  • Schema: Zod-validated ExperimentRecord — one row per round
  • Storage: SQLite with WAL mode, INSERT OR IGNORE for idempotency
  • CLI: export / count / import for moving experiments between environments

Phase 2: Account State Collector

  • Inputs: GTM containers, Google Ads accounts, Meta (Pipeboard) ad accounts
  • Method: MCP tool call map rendered into a system prompt
  • Output: a single AccountState blob the agent conditions on each round

Phase 3: JSONL Training Data

TRAINING DATA BUILDER experiments.sqlite [Phase 1] │ ▼ ┌────────────┐ │ score filter │ keep only rounds above threshold └────┬─────┘┌────────────┐ │ Chroma dedup │ drop near-duplicate patches └────┬─────┘training.jsonl → Phase 4

Phase 4: Fine-Tune Runner — Dual Track

Track ATrack B
WhereOpenAI cloudLocal Ollama, M3 Ultra
Good forbest qualityprivacy + zero cost per token
Modelsgpt-4o-mini / 4ollama3 / qwen2.5-coder
Registryshared model registry with versioned tagsshared model registry with versioned tags

Phase 5: OpenClaw Client Brain

  • OpenClaw listens on :18789 and routes GTM-related prompts to the client-specialized brain
  • Middleware stack handles auth, logging, and fallback to a generalist model if the fine-tune is cold
  • Same request shape as a normal LLM call — the brain swap is invisible to callers

Phase 6: The Flywheel

  • Watcher events: loop completions trigger a lightweight rebuild check
  • Drift detection: if the live client brain starts scoring below a new generalist baseline, auto-rollback
  • Compounding: every night’s winners improve next week’s starting prompts
// MORNING DELIVERABLES

What You Wake Up To

Run it overnight. Morning deliverables: a staging workspace ready to publish, a versioned config in R2, and a full experiment log. ~100 experiments over a weekend — never publishes live.

The operator contract: the loop never touches production. A human reviews the winner and clicks publish. Safety is architectural, not a flag you could forget.

The Four Deliverables

DeliverableWhat you get
Staging workspaceGTM workspace with winning config — one-click publish when you’re ready
Versioned JSONwinning-config.json stored in R2 — rollback to any previous night’s best
Experiment logevery patch tested, scored, kept or reverted — full audit trail with diffs
Playwright QAeach experiment validated in staging preview — tag firing, params, dedup all checked

Morning Ritual

# morning deliverables const workspace = await gtm.getWorkspace("autoresearch-nightly") // review what changed console.log(workspace.changelog) // → 14 tags modified // → score: 0.72 → 0.91 // → 47 experiments run // happy? one-click publish await gtm.publishWorkspace(workspace) // or grab the JSON for review const config = await r2.get("winning-config.json") await gtm.importContainer(config)

Typical Weekend Outcome

Rounds run~100
Rounds kept~25 (quality gate rejects ~75%)
Score lift0.72 → 0.91 typical
Tags modified10–20 across consent, dedup, params
Total cost~$0.60 / client on Sonnet + Opus 4.6 tail after escalation
Operator time5 minutes: review changelog, publish

Full Audit Trail

  • Every round’s before / after container JSONs stored alongside the score delta
  • Every rejected patch logged with its regression dimension
  • Diffs viewable per round — nothing disappears silently
  • R2 keeps N nights of winners so you can bisect regressions across weeks
// IMPLEMENTATION

Stack & Conventions

TypeScript-first, minimal dependencies, no build step. The loop is small enough to read in one sitting — that’s a feature.

Runtime

TypeScript
tsx runner
Node 18+
no compile step
fswatch

Dependencies

PackagePurpose
zod ^3.22ExperimentRecord + AccountState schemas
dotenv ^16.4.env loader
chalk ^5.3terminal colors for loop output
ora ^8.0spinners for long-running rounds
tsx ^4.11 (dev)run .ts directly, no build
typescript ^5.4 (dev)typecheck only (tsc --noEmit)
Why so few deps: SQLite is stdlib-adjacent, MCP tool calls go via Claude Code, Cloudflare bits come from Wrangler. Less surface area means faster overnight reliability.

External Services

Anthropic API — Two-Tier

Claude Sonnet explores while score < 0.92. First cross of the threshold escalates one-way to Claude Opus 4.6 for refinement. Plateau stop only fires on Opus.

Google Tag Manager API

Read / write staging workspace, apply container JSON, never publish to live.

Cloudflare R2

Versioned storage for winning-config.json — one object per night.

Cloudflare Worker

Webhook receiver for Apify actor completions — feeds the auto-research engine.

Apify

Plugin marketplace watcher. REST only, no SDK pulled in.

OpenClaw (:18789)

Routes prompts to the fine-tuned client brain once Phase 5 lands.

Conventions

  • All scripts run via npx tsx scripts/<name>.ts
  • All outputs are idempotent — re-running never duplicates
  • Errors logged to data/errors/{timestamp}.log, never crash silently
  • Console logs use phase prefix: [Phase0], [Phase1], etc.
  • Every run writes a manifest to data/signals/run-history.json
  • scripts/run-all.sh chains the full pipeline in order

Repo Layout

# top-level content/ # GTM templates (shopify-ecom-web.json, …) data/ # SQLite, signals, error logs evals/ # eval_gtm_signal_quality.ts + others scripts/ # run-gtm-loop.ts, hydrate-gtm-template.ts, loops/ DOCUMENTATION/ # phase deep dives CLAUDE.md # agent conventions .env.example # required vars
// SELF-HOST

Deploy & Run

Clone, configure, and let it loop overnight. This guide itself is a single-file HTML deploy — the pattern scales down to docs and up to the loop runner.

1. Clone & Install

git clone https://github.com/Organized-AI/gtm-autoresearch cd gtm-autoresearch git checkout feature/finetune-pipeline npm install

2. Configure .env

cp .env.example .env # fill in: ANTHROPIC_API_KEY=sk-ant-… GTM_ACCOUNT_ID= GTM_CONTAINER_ID= GTM_WORKSPACE_ID=autoresearch-nightly R2_BUCKET=gtm-winners OBSIDIAN_VAULT_PATH=/path/to/vault

3. Typecheck & Dry Run

npm run typecheck # one eval to sanity-check the scorer npm run eval:gtm -- content/gtm-templates/shopify-ecom-web.json

4. Launch the Loop

# headless overnight run npm run gtm-loop # or the full auto-research pipeline (watches Claude Code logs too) npm run watch

Integration with OpenClaw

  • OpenClaw runs on port :18789 and is the request-routing entry point
  • Point Phase 5’s brain endpoint at OpenClaw so fine-tuned outputs serve GTM prompts automatically
  • Fallback middleware sends to a generalist model if the client brain is cold or drift-flagged

Deploying These Docs

# the pattern that deployed this page wrangler pages project create gtm-autoresearch-guide --production-branch=main wrangler pages deploy gtm-autoresearch \ --project-name=gtm-autoresearch-guide \ --branch=main \ --commit-dirty=true
Single-file docs: no build tooling, no framework. One index.html, one Wrangler deploy. Same ethos as the loop itself — small, legible, cheap to run.

Operational Checklist

Never publish live

The workspace ID should always be a staging name. Publishing stays manual.

Keep R2 history

Retain N nights so regressions can be bisected. Cheap storage, expensive lessons.

Watch the scorer weights

Meta Ads alignment is weighted x2. Adjust if your revenue mix differs.

Budget the two-tier spend

~$0.60/client on Sonnet, plus a short Opus 4.6 tail after escalation. Set a monthly cap on your Anthropic key.

Review changelog daily

5 minutes of human review catches anything the 9-dim scorer doesn’t see.

Fork the template

Clone shopify-ecom-web.json as a starting point for new verticals.

gtm-autoresearch
8 sections
feature/finetune-pipeline
gtm-autoresearch-guide.pages.dev/#home