gtm-autoresearch // feature/finetune-pipeline // Phase 6 of 6

The
Flywheel

More client work → more experiments → better training data → smarter model → more accurate answers → more client work
The Compounding Loop
                        CLIENT ENGAGEMENT
                        (GTM fixes, ad optimizations)
                               
                    ┌──────────▼──────────┐
                      AUTORESEARCH LOOP    
                      50-100 experiments   
                      per run, scored      
                    └──────────┬──────────┘
                               
                    ┌──────────▼──────────┐
                      FLYWHEEL WATCHER    ←─────────────────────────┐
                      counts new                                    
                      high-score records                            
                    └──────────┬──────────┘                          
                               │ delta ≥ 20                         
                    ┌──────────▼──────────┐                          
                      PIPELINE EXPORT                               
                      JSONL v{N} appended                           
                      quality gates pass                            
                    └──────────┬──────────┘                          
                                                                   
                    ┌──────────▼──────────┐                          
                      FINE-TUNE RUNNER                              
                      Track A or B                                  
                      eval → promote                                
                    └──────────┬──────────┘                          
                                                                   
                    ┌──────────▼──────────┐                          
                      OPENCLAW BRAIN                                
                      new model active                              
                      smarter responses   ├──── DRIFT CHECK ─────────┘
                    └─────────────────────┘  (telemetry → regression?)
Watcher Trigger Events
EventConditionActionConfigurable
New examples gate New high-score experiments since last export ≥ threshold Trigger JSONL pipeline export → bump version RETRAIN_DELTA=20
Scheduled run Cron: after each autoresearch run completes Check new example count, trigger if gate met CRON_SCHEDULE
Drift detection Telemetry shows eval score drop > tolerance vs last version Alert + optionally rollback to last stable version DRIFT_TOLERANCE=0.05
Account state change AccountState minor/major version bump detected Force retrain — old training data used stale account context RETRAIN_ON_STATE_CHANGE=true
Manual trigger pnpm flywheel run --client hre --force Bypass delta gate, run full pipeline immediately always available
Watcher Configuration Per Client

Retrain Delta

20
Minimum new high-score experiments before triggering a retrain. Prevents over-training on small batches.

Drift Tolerance

0.05
Max acceptable eval score regression between versions. Below this → rollback to last stable model automatically.

Max Versions Kept

5
Older model versions pruned after this count. Keeps registry lean. Fine-tune files deleted from OpenAI too.
Drift Detection — Eval Score Over Versions
HRE model eval score per version. Dashed line = min acceptable (prev - 0.05). Regression in v4 triggers auto-rollback.
v1
0.62
v2
0.71
v3 ●
0.84
v4 ✕
0.72
improving within tolerance regression → rollback
v4 regression: 0.72 < (0.84 - 0.05 = 0.79) → auto-rolled back to v3. Slack alert sent. v4 archived.
Notification Events

Slack Notifications

✓ hre v4 promoted → active (eval: 0.91)
⚠ hre v4 regression (0.72) → rolled back to v3
◎ teleios: 50 examples reached → retrain queued
⟳ Account state v2.0.0 detected → force retrain hre

Retrain Schedule (HRE)

Last autoresearch run: 2026-04-07
New high-score examples: 23 / 20 ✓
Retrain triggered: yes → v4 queued
Current active: v3 (0.84)
Next check: after next research run
Flywheel Watcher Config — data/clients/{id}/flywheel.json
{ "client_id": "hre", "retrain_delta": 20, // new examples needed to trigger retrain "drift_tolerance": 0.05, // max acceptable eval regression "max_versions": 5, // prune older versions beyond this count "retrain_on_state_change": true, // force retrain if AccountState major bumped "auto_promote": true, // auto-promote if eval passes, no manual step "auto_rollback": true, // auto-rollback on regression "notifications": { "slack_channel": "#organized-ai-ops", "events": ["promote", "rollback", "gate_met", "state_change"] }, "last_export_run_id": "exp-2026-04-07-100", // cursor for delta tracking "last_retrain_version": "v3" }
Complete Pipeline — All 6 Phases

gtm-autoresearch · feature/finetune-pipeline

Phase 1
Experiment Logger — SQLite ExperimentRecord schema, score normalization
✓ specced
Phase 2
Account State Collector — GTM + Google Ads + Meta MCP → AccountState JSON
✓ specced
Phase 3
JSONL Pipeline — Score filter, Chroma dedup, system prompt injection, quality gates
✓ specced
Phase 4
Fine-Tune Runner — Track A (OpenAI) + Track B (Ollama M3 Ultra), eval harness
✓ specced
Phase 5
OpenClaw Brain — ClientID middleware, ModelRouter, fallback chain, telemetry
✓ specced
Phase 6
Flywheel — Watcher, drift detection, auto-promote/rollback, Slack notifications
← building
CLI Output — pnpm flywheel start --client hre
$ pnpm flywheel start --client hre

──────────────────────────────────────────────────────
  Flywheel Watcher — HRE
──────────────────────────────────────────────────────
  Config loaded...           ✓ flywheel.json (retrain_delta: 20)
  Last export cursor:        exp-2026-04-07-100
  Checking new examples...   ✓ 23 new high-score experiments
  Delta gate (≥20):          ✓ 23 ≥ 20 — triggering pipeline

  → Running JSONL export...
    Score filter:            ✓ 23 records pass (≥0.75)
    Dedup check:             ✓ 2 removed, 21 kept
    Quality gates:           ✓ all pass
    Output:                  ✓ data/clients/hre/v4.jsonl (21 records)

  → Submitting fine-tune (Track A)...
    Upload:                  ✓ file-xyz789
    Job created:             ✓ ftjob-xyz789
    Training...              [ 15 min ]
    Succeeded:               ✓ ft:gpt-4o-mini-2024-07-18:hre-v4

  → Running eval harness...
    v4 eval score:           ✓ 0.91
    vs v3 (0.84):            ✓ +0.07 improvement
    Regression guard:        ✓ pass

  → Promoting...
    v3 → archived            
    v4 → active              
    OpenClaw reloaded:       ✓ hre → ft:gpt-4o-mini:hre-v4
    Slack notified:          ✓ #organized-ai-ops

  Cursor updated:            exp-2026-04-14-023
──────────────────────────────────────────────────────
  ⟳ Flywheel complete — HRE model v4 active (0.91)
  
Claude Code Prompt
claude --dangerously-skip-permissions # Phase 6: Flywheel Automation # Branch: feature/finetune-pipeline # Repo: github.com/Organized-AI/gtm-autoresearch ## Context All prior phases complete. Phase 6 closes the loop — a watcher that triggers the full pipeline automatically after each autoresearch run when enough new high-quality examples have accumulated. ## Task Read AGENT-HANDOFF/ and PLANNING/ first. Then: 1. Create packages/flywheel/ FlywheelWatcher: - Load flywheel.json per client - Track cursor: last_export_run_id (last SQLite run_id processed) - On run: COUNT experiments WHERE run_id > cursor AND score >= threshold - If count >= retrain_delta: trigger full pipeline - Update cursor to latest run_id after trigger Pipeline Orchestrator (calls existing packages in sequence): 1. training-pipeline export --client {id} --delta 2. fine-tune-runner submit --client {id} --version {N} --track {pref} 3. fine-tune-runner eval --client {id} --version {N} 4. If eval passes: openclaw register + promote 5. If eval regresses: rollback + alert DriftDetector: - After each promotion, query telemetry SQLite (from Phase 5) - Compare live response quality vs eval score baseline - If delta > drift_tolerance over 24h window: trigger alert - Optional: schedule periodic re-eval against holdout set AccountStateWatcher: - Watch AccountState version file for major/minor bumps - On major bump: force full retrain (bypass delta gate) - On minor bump: log + retrain at next natural cycle NotificationService: - Slack webhook to #organized-ai-ops - Events: promote, rollback, gate_met, regression, state_change - Message format: emoji + client_id + event + key metric VersionPruner: - After each promote, check version count - If > max_versions: delete oldest archived version - For Track A: call OpenAI DELETE /v1/files/{file_id} - For Track B: call ollama rm {client_id}-client:v{old} 2. flywheel.json schema per client (data/clients/{id}/flywheel.json): retrain_delta, drift_tolerance, max_versions, retrain_on_state_change, auto_promote, auto_rollback, notifications{slack_channel, events[]}, last_export_run_id (cursor), last_retrain_version 3. CLI: pnpm flywheel start --client hre # run once now pnpm flywheel watch --client hre # watch mode (post-research hook) pnpm flywheel status --client hre # show current state pnpm flywheel run --client hre --force # bypass delta gate 4. Hook into autoresearch run completion: Add post-run hook that calls: pnpm flywheel watch --client {client_id} 5. Unit tests: - Delta gate: triggers at exactly retrain_delta, not before - Rollback: activates when eval drops below tolerance - Cursor: advances only after successful pipeline run - Pruner: removes correct version when count > max_versions - NotificationService: formats Slack message correctly ## Env vars: CLIENT_DATA_DIR=./data/clients SLACK_WEBHOOK_URL=https://hooks.slack.com/services/... RETRAIN_DELTA=20 DRIFT_TOLERANCE=0.05 MAX_MODEL_VERSIONS=5 ## This is the final phase. After completion: ## - Run full integration test: all 6 phases end-to-end with HRE ## - Update README with complete pipeline diagram ## - Tag: v1.0.0-finetune-pipeline
← Phase 5: OpenClaw gtm-autoresearch-docs.pages.dev All Docs →