Phase 3 — JSONL Training Data Pipeline

System Overview

Phase 1

Experiment Logger

✓

complete

Phase 2

Account State

✓

complete

Phase 3 ← NOW

JSONL Pipeline

—

building

Phase 4

Fine-Tune Runner

—

Phase 5

OpenClaw Brain

—

planned

Phase 6

Flywheel

—

planned

Step 1 — Score Filter

Score Distribution

Typical autoresearch run · 100 experiments

0.9–1.0

~18

0.8–0.9

~24

0.75–0.8

~13

0.6–0.75

~28

0.0–0.6

~17

✓ ~55 records kept per run · threshold configurable per client

Deduplication

Cosine similarity via Chroma :37777 · threshold 0.92

Incoming experiment vs existing training set:

"PMAX shows $0 conversion value — sGTM items mapping issue"

sim: 0.97 — DROP

≈

"PMAX revenue zero — fix sGTM ecommerce.items array"

"Add to Cart firing on page load instead of button click"

sim: 0.31 — KEEP

≠

"Purchase tag misfiring on order confirmation reload"

Prevents model overfitting to repeated variations of the same problem. New experiments are embedded and compared before write.

Step 2 — System Prompt Injection

system_prompt field anatomy — HRE account · AccountState v1.2.0 target: ≤800 tokens

GTM Context

GTM: GTM-XXXXXXX | ws_4 | 31 tags, 18 triggers, 22 variables Key tags: GA4 Config, GA4 - Purchase, AW - All Purchases, sGTM Bridge, Custom HTML - DataLayer Push dataLayer: page_view, view_item, add_to_cart, purchase, generate_lead DLV: ecommerce.value, ecommerce.items[].item_id, ecommerce.items[].price

Google Ads

Google Ads: 123-456-7890 | USD Campaigns: PMAX - Core, Brand - HRE (SEARCH), Retargeting (DISPLAY) Conversions: All Purchases (value: ecommerce.value), Lead Submit (count) Enhanced Conversions: ENABLED

Meta / CAPI

Meta Pixel: HRE-pixel-id | act_12345 Events (browser+CAPI): PageView, ViewContent, AddToCart, Purchase CAPI match rate: ~68% | Stape: cnt_abc123

Memory

Cart abandonment tracking fixed 2025-11 dataLayer conflict resolved 2025-09 PMAX value fix deployed 2026-01 Lead form dedup logic added 2026-03

Known Issues

→ PMAX $0 value: sGTM reads ecommerce.value but HRE pushes revenue inside items[].price * quantity → Lead dedup: generate_lead fires on page load + form submit

tokens

~576 / 800 ✓ within budget

Step 3 — JSONL Record Schema

training/{client_id}/v{N}.jsonl — one record per line OpenAI fine-tune format

messages[0].role string Always "system" — contains rendered AccountState.system_prompt (≤800 tokens)

messages[1].role string Always "user" — the problem/question from the autoresearch experiment input

messages[2].role string Always "assistant" — the high-scoring solution output (score ≥ 0.75)

metadata.client_id string e.g. "hre" — used to filter and route to the correct fine-tuned model

metadata.score float Original experiment score 0.0–1.0. Stored for post-training analysis and threshold tuning

metadata.run_id string Experiment run identifier. Links back to ExperimentRecord in SQLite for traceability

metadata.account_state_version semver AccountState version snapshot at time of experiment. Enables version-gated retraining after major account changes

metadata.sources string[] Which data sources were injected: ["gtm", "google_ads", "meta", "memory"]

metadata.chroma_embedding_id string Chroma vector ID for this record — enables dedup lookup on future runs

Step 4 — Quality Gates Before Export

⚖️

Min Examples

Won't export JSONL until minimum batch size reached

📏

Token Guard

Rejects records where system prompt exceeds token ceiling

800

🔁

Dedup Rate

Flags run if >40% of new experiments are duplicates

40%

🧪

Holdout Split

10% withheld per version for post-fine-tune eval scoring

10%

CLI Output — pnpm pipeline export --client hre

$ pnpm pipeline export --client hre --threshold 0.75

─────────────────────────────────────────────
  JSONL Training Pipeline — HRE
─────────────────────────────────────────────
  Loading experiments from SQLite...          ✓ 247 records found
  Applying score filter (≥0.75)...            ✓ 138 records pass
  Loading AccountState v1.2.0...              ✓ system_prompt: 576 tokens
  Injecting system prompts...                 ✓ 138 records injected
  Running dedup check via Chroma...           ⚠  14 near-duplicates removed (sim ≥0.92)
  Remaining after dedup:                      ✓ 124 records
  Checking quality gates:
    Min examples (50):                        ✓ 124 ≥ 50
    Token guard (800):                        ✓ all records within budget
    Dedup rate (<40%):                        ✓ 10.1% dedup rate
    Holdout split (10%):                      ✓ 12 records withheld → eval set

─────────────────────────────────────────────
  Output:  data/clients/hre/v3.jsonl         112 training records
  Eval:    data/clients/hre/v3_eval.jsonl    12 eval records
  Version: v3  (prev: v2 · delta: +112 new records)
─────────────────────────────────────────────
  ✓ Ready for Phase 4 fine-tune runner
  Next: pnpm fine-tune submit --client hre --version 3

Claude Code Prompt

claude --dangerously-skip-permissions

# Phase 3: JSONL Training Data Pipeline
# Branch: feature/finetune-pipeline
# Repo: github.com/Organized-AI/gtm-autoresearch

## Context
Phase 1 (ExperimentLogger) and Phase 2 (AccountStateCollector) are complete.
Phase 3 builds the pipeline that transforms stored experiments into
clean, fine-tune-ready JSONL files. Primary client: HRE.

## Task

Read AGENT-HANDOFF/ and PLANNING/ first. Then:

1. Create packages/training-pipeline/

   ScoreFilter:
   - Query SQLite experiments table: SELECT * WHERE score >= threshold
   - Default threshold: 0.75 (env: SCORE_THRESHOLD)
   - Return ExperimentRecord[] sorted by score DESC

   DedupChecker (Chroma :37777):
   - Embed each experiment's solution text via Chroma embedding
   - Query: find existing records with cosine_sim >= 0.92
   - If match found: skip record, log as duplicate
   - If new: write embedding to Chroma collection "training_{client_id}"

   SystemPromptInjector:
   - Load AccountState from data/clients/{client_id}/account_state_v{N}.json
   - For each passing record: build messages[0] = {role:"system", content: account_state.system_prompt}
   - Token guard: count tokens (use tiktoken cl100k_base), reject if > 800
   - Build full message array: [system, user, assistant]

   QualityGates (assert before export):
   - MIN_EXAMPLES = 50 (configurable)
   - TOKEN_CEILING = 800
   - MAX_DEDUP_RATE = 0.40 (warn + halt if exceeded)
   - HOLDOUT_SPLIT = 0.10

   JSONLExporter:
   - Write to: data/clients/{client_id}/v{N}.jsonl
   - Holdout: data/clients/{client_id}/v{N}_eval.jsonl
   - Each line: {messages:[...], metadata:{client_id, score, run_id,
     account_state_version, sources, chroma_embedding_id}}
   - Versioning: auto-increment v{N} by checking existing files
   - Delta mode: --delta flag to only export records since last version

2. CLI: pnpm pipeline export --client hre [--threshold 0.75] [--delta]
   - Print progress as shown in spec
   - Exit 0 on success, 1 on gate failure

3. Unit tests:
   - Score filter boundary conditions (0.749 excluded, 0.75 included)
   - Token guard rejects > 800 token system prompts
   - Dedup correctly identifies near-duplicate solutions
   - QualityGate halts export when MIN_EXAMPLES not met
   - JSONL output is valid JSON per line (json-lines format)

4. Update CLAUDE.md with training-pipeline package docs

## Env vars:
SCORE_THRESHOLD=0.75
CHROMA_URL=http://localhost:37777
CLIENT_DATA_DIR=./data/clients
MIN_TRAINING_EXAMPLES=50
TOKEN_CEILING=800
HOLDOUT_SPLIT=0.10

## Do NOT build Phase 4 (fine-tune runner) yet.

← Phase 2: Account State gtm-autoresearch-docs.pages.dev All Docs →

JSONL TrainingData Pipeline

Score Distribution

Deduplication

JSONL Training
Data Pipeline