gtm-autoresearch // feature/finetune-pipeline // Phase 3 of 6

JSONL Training
Data Pipeline

03 Filter · Dedup
Inject · Export
Client: HRE
System Overview
Phase 1
Experiment Logger
complete
Phase 2
Account State
complete
Phase 3 ← NOW
JSONL Pipeline
building
Phase 4
Fine-Tune Runner
next
Phase 5
OpenClaw Brain
planned
Phase 6
Flywheel
planned
Step 1 — Score Filter

Score Distribution

Typical autoresearch run · 100 experiments
0.9–1.0
~18
0.8–0.9
~24
0.75–0.8
~13

0.6–0.75
~28
0.0–0.6
~17
✓ ~55 records kept per run  ·  threshold configurable per client

Deduplication

Cosine similarity via Chroma :37777 · threshold 0.92
Incoming experiment vs existing training set:
"PMAX shows $0 conversion value — sGTM items mapping issue"
sim: 0.97 — DROP
"PMAX revenue zero — fix sGTM ecommerce.items array"
"Add to Cart firing on page load instead of button click"
sim: 0.31 — KEEP
"Purchase tag misfiring on order confirmation reload"
Prevents model overfitting to repeated variations of the same problem. New experiments are embedded and compared before write.
Step 2 — System Prompt Injection
system_prompt field anatomy — HRE account · AccountState v1.2.0 target: ≤800 tokens
GTM Context
GTM: GTM-XXXXXXX | ws_4 | 31 tags, 18 triggers, 22 variables Key tags: GA4 Config, GA4 - Purchase, AW - All Purchases, sGTM Bridge, Custom HTML - DataLayer Push dataLayer: page_view, view_item, add_to_cart, purchase, generate_lead DLV: ecommerce.value, ecommerce.items[].item_id, ecommerce.items[].price
Google Ads
Google Ads: 123-456-7890 | USD Campaigns: PMAX - Core, Brand - HRE (SEARCH), Retargeting (DISPLAY) Conversions: All Purchases (value: ecommerce.value), Lead Submit (count) Enhanced Conversions: ENABLED
Meta / CAPI
Meta Pixel: HRE-pixel-id | act_12345 Events (browser+CAPI): PageView, ViewContent, AddToCart, Purchase CAPI match rate: ~68% | Stape: cnt_abc123
Memory
Cart abandonment tracking fixed 2025-11 dataLayer conflict resolved 2025-09 PMAX value fix deployed 2026-01 Lead form dedup logic added 2026-03
Known Issues
→ PMAX $0 value: sGTM reads ecommerce.value but HRE pushes revenue inside items[].price * quantity → Lead dedup: generate_lead fires on page load + form submit
tokens
~576 / 800 ✓ within budget
Step 3 — JSONL Record Schema
training/{client_id}/v{N}.jsonl — one record per line OpenAI fine-tune format
messages[0].role string Always "system" — contains rendered AccountState.system_prompt (≤800 tokens)
messages[1].role string Always "user" — the problem/question from the autoresearch experiment input
messages[2].role string Always "assistant" — the high-scoring solution output (score ≥ 0.75)
metadata.client_id string e.g. "hre" — used to filter and route to the correct fine-tuned model
metadata.score float Original experiment score 0.0–1.0. Stored for post-training analysis and threshold tuning
metadata.run_id string Experiment run identifier. Links back to ExperimentRecord in SQLite for traceability
metadata.account_state_version semver AccountState version snapshot at time of experiment. Enables version-gated retraining after major account changes
metadata.sources string[] Which data sources were injected: ["gtm", "google_ads", "meta", "memory"]
metadata.chroma_embedding_id string Chroma vector ID for this record — enables dedup lookup on future runs
Step 4 — Quality Gates Before Export
⚖️
Min Examples
Won't export JSONL until minimum batch size reached
50
📏
Token Guard
Rejects records where system prompt exceeds token ceiling
800
🔁
Dedup Rate
Flags run if >40% of new experiments are duplicates
40%
🧪
Holdout Split
10% withheld per version for post-fine-tune eval scoring
10%
CLI Output — pnpm pipeline export --client hre
$ pnpm pipeline export --client hre --threshold 0.75

─────────────────────────────────────────────
  JSONL Training Pipeline — HRE
─────────────────────────────────────────────
  Loading experiments from SQLite...          ✓ 247 records found
  Applying score filter (≥0.75)...            ✓ 138 records pass
  Loading AccountState v1.2.0...              ✓ system_prompt: 576 tokens
  Injecting system prompts...                 ✓ 138 records injected
  Running dedup check via Chroma...           ⚠  14 near-duplicates removed (sim ≥0.92)
  Remaining after dedup:                      ✓ 124 records
  Checking quality gates:
    Min examples (50):                        ✓ 124 ≥ 50
    Token guard (800):                        ✓ all records within budget
    Dedup rate (<40%):                        ✓ 10.1% dedup rate
    Holdout split (10%):                      ✓ 12 records withheld → eval set

─────────────────────────────────────────────
  Output:  data/clients/hre/v3.jsonl         112 training records
  Eval:    data/clients/hre/v3_eval.jsonl    12 eval records
  Version: v3  (prev: v2 · delta: +112 new records)
─────────────────────────────────────────────
  ✓ Ready for Phase 4 fine-tune runner
  Next: pnpm fine-tune submit --client hre --version 3
    
Claude Code Prompt
claude --dangerously-skip-permissions # Phase 3: JSONL Training Data Pipeline # Branch: feature/finetune-pipeline # Repo: github.com/Organized-AI/gtm-autoresearch ## Context Phase 1 (ExperimentLogger) and Phase 2 (AccountStateCollector) are complete. Phase 3 builds the pipeline that transforms stored experiments into clean, fine-tune-ready JSONL files. Primary client: HRE. ## Task Read AGENT-HANDOFF/ and PLANNING/ first. Then: 1. Create packages/training-pipeline/ ScoreFilter: - Query SQLite experiments table: SELECT * WHERE score >= threshold - Default threshold: 0.75 (env: SCORE_THRESHOLD) - Return ExperimentRecord[] sorted by score DESC DedupChecker (Chroma :37777): - Embed each experiment's solution text via Chroma embedding - Query: find existing records with cosine_sim >= 0.92 - If match found: skip record, log as duplicate - If new: write embedding to Chroma collection "training_{client_id}" SystemPromptInjector: - Load AccountState from data/clients/{client_id}/account_state_v{N}.json - For each passing record: build messages[0] = {role:"system", content: account_state.system_prompt} - Token guard: count tokens (use tiktoken cl100k_base), reject if > 800 - Build full message array: [system, user, assistant] QualityGates (assert before export): - MIN_EXAMPLES = 50 (configurable) - TOKEN_CEILING = 800 - MAX_DEDUP_RATE = 0.40 (warn + halt if exceeded) - HOLDOUT_SPLIT = 0.10 JSONLExporter: - Write to: data/clients/{client_id}/v{N}.jsonl - Holdout: data/clients/{client_id}/v{N}_eval.jsonl - Each line: {messages:[...], metadata:{client_id, score, run_id, account_state_version, sources, chroma_embedding_id}} - Versioning: auto-increment v{N} by checking existing files - Delta mode: --delta flag to only export records since last version 2. CLI: pnpm pipeline export --client hre [--threshold 0.75] [--delta] - Print progress as shown in spec - Exit 0 on success, 1 on gate failure 3. Unit tests: - Score filter boundary conditions (0.749 excluded, 0.75 included) - Token guard rejects > 800 token system prompts - Dedup correctly identifies near-duplicate solutions - QualityGate halts export when MIN_EXAMPLES not met - JSONL output is valid JSON per line (json-lines format) 4. Update CLAUDE.md with training-pipeline package docs ## Env vars: SCORE_THRESHOLD=0.75 CHROMA_URL=http://localhost:37777 CLIENT_DATA_DIR=./data/clients MIN_TRAINING_EXAMPLES=50 TOKEN_CEILING=800 HOLDOUT_SPLIT=0.10 ## Do NOT build Phase 4 (fine-tune runner) yet.
← Phase 2: Account State gtm-autoresearch-docs.pages.dev All Docs →