gtm-autoresearch // feature/finetune-pipeline // Phase 4 of 6

Fine-Tune
Runner Dual Track

Two parallel execution paths — OpenAI API for fast cloud fine-tunes, Ollama/MLX for private local fine-tunes on M3 Ultra. Triggered automatically when JSONL version is ready.

04
Phase 4 of 6
Two Execution Tracks
A

OpenAI Fine-Tune API

Cloud · gpt-4o-mini · $5–15 per client
Target model
gpt-4o-mini
Trigger
JSONL v{N} ready + ≥50 examples
API flow
Upload file → create job → poll → register model ID
Infra
OpenAI cloud — no local GPU needed
Data egress
Training data leaves network
Best for
New clients while data accumulates, fast MVP
Est. cost
~$5–15 per training run (100 examples)
B

Ollama Local (NoClaw)

M3 Ultra · :11434 · zero data egress
Target model
Llama 3.1 8B or Mistral
Trigger
Same gate — JSONL v{N} + ≥50 examples
API flow
JSONL → Modelfile → ollama create → register
Infra
M3 Ultra 512GB (Tailscale 100.x.x.x) via NoClaw
Data egress
Zero — stays on Tailscale network
Best for
Sensitive clients (Teleios), long-term production
Est. cost
Electricity only — ~$0 marginal cost
Track Selection Matrix
Scenario
Track A (OpenAI)
Track B (Ollama)
Recommendation
New client, <200 examples
✓ Good baseline fast
Possible, lower quality
Track A
Mature client, 500+ examples
✓ Strong performance
✓ Strong performance
Track B (private)
Healthcare / sensitive data
✗ Data egress risk
✓ Zero egress
Track B only
Rapid iteration / testing
✓ Fast turnaround
Slower iteration cycle
Track A
Production serving at scale
Per-token cost accrues
✓ Fixed infra cost
Track B
Track A — OpenAI API Flow
01 Upload
POST files endpoint with v{N}.jsonl · purpose: fine-tune · returns file_id
02 Create
POST fine_tuning/jobs · model: gpt-4o-mini-2024-07-18 · training_file: file_id · suffix: hre-v3
03 Poll
GET fine_tuning/jobs/{job_id} every 60s · status: queued → running → succeeded · emit progress events
04 Register
On success: write fine_tuned_model ID to data/clients/hre/model_registry.json · set active: true · run eval harness
Track B — Ollama Modelfile + Local Flow
FROM llama3.1:8b # System prompt baked from AccountState.system_prompt SYSTEM """ You are a conversion tracking expert for HRE. GTM: GTM-XXXXXXX | ws_4 | 31 tags, 18 triggers, 22 variables Key tags: GA4 Config, GA4 - Purchase, AW - All Purchases, sGTM Bridge dataLayer: page_view, add_to_cart, purchase, generate_lead DLV: ecommerce.value, ecommerce.items[].price Google Ads: 123-456-7890 | PMAX - Core, Brand - HRE, Retargeting Conversions: All Purchases (value: ecommerce.value), Lead Submit Meta Pixel: HRE-pixel-id | CAPI match: ~68% | Stape: cnt_abc123 Known: PMAX $0 value — sGTM reads top-level ecommerce.value but HRE pushes revenue inside items[].price * quantity """ # Fine-tune parameters PARAMETER temperature 0.2 PARAMETER num_ctx 4096 PARAMETER stop "<|eot_id|>" # Model metadata LABEL client_id="hre" LABEL version="v3" LABEL account_state_version="1.2.0" LABEL training_examples="112" LABEL created="2026-04-07"
Model Registry — data/clients/{id}/model_registry.json
Client
Version
Track
Eval Score
Model ID
Status
hre
v3
A
0.84
ft:gpt-4o-mini:hre-v3
● active
hre
v2
A
0.71
ft:gpt-4o-mini:hre-v2
archived
teleios
v1
B
pending
hre-client-teleios:v1
◐ eval
rtt
awaiting 50 examples
queued
Eval Harness — Auto-Score Before Promoting

Problem Recall

Feed the 10-question holdout set. Score: did the model identify the correct root cause?

84%

Solution Accuracy

Compare model solution to known-correct fix. Scored by cosine similarity to reference answer.

0.87 sim

Regression Guard

v{N} eval score must exceed v{N-1} by margin. Rollback triggered if regression detected.

+0.13 Δ
CLI Output — pnpm fine-tune submit --client hre --version 3
$ pnpm fine-tune submit --client hre --version 3 --track a

──────────────────────────────────────────────────────
  Fine-Tune Runner — HRE v3 · Track A (OpenAI)
──────────────────────────────────────────────────────
  Loading JSONL...           ✓ data/clients/hre/v3.jsonl (112 records)
  Uploading to OpenAI...     ✓ file_id: file-abc123xyz
  Creating fine-tune job...  ✓ job_id: ftjob-abc123
  Model suffix:              gpt-4o-mini-2024-07-18:hre-v3

  Polling status...
    [00:00] queued
    [02:14] running — step 12/112
    [08:31] running — step 67/112
    [14:02] running — step 112/112
    [15:44] succeeded

  Fine-tuned model:          ft:gpt-4o-mini-2024-07-18:hre-v3

──────────────────────────────────────────────────────
  Running eval harness (holdout: v3_eval.jsonl)...
    Problem recall:          ✓ 84%
    Solution accuracy:       ✓ 0.87 cosine sim
    Regression vs v2:        ✓ +0.13 improvement

──────────────────────────────────────────────────────
  Registering model...       ✓ model_registry.json updated
  Promoting to active...     ✓ hre → ft:gpt-4o-mini:hre-v3
  Archiving v2...            ✓ archived
──────────────────────────────────────────────────────
  ✓ HRE v3 live — ready for Phase 5 OpenClaw routing
  Next: pnpm openclaw register --client hre --model ft:gpt-4o-mini:hre-v3
    
Claude Code Prompt
claude --dangerously-skip-permissions # Phase 4: Fine-Tune Runner (Dual Track) # Branch: feature/finetune-pipeline # Repo: github.com/Organized-AI/gtm-autoresearch ## Context Phases 1–3 complete. Phase 4 builds the runner that takes a versioned JSONL file and submits it to either: Track A: OpenAI fine-tuning API (cloud, gpt-4o-mini) Track B: Ollama on M3 Ultra via NoClaw :11434 (local, Llama 3.1 8B) ## Task Read AGENT-HANDOFF/ and PLANNING/ first. Then: 1. Create packages/fine-tune-runner/ FineTuneRunner interface: - submit(client_id, version, track): Promise - poll(job_id): Promise - evaluate(client_id, version): Promise - register(client_id, version, model_id, track): void - promote(client_id, version): void (sets active: true, archives prev) TrackA (OpenAI): - Upload JSONL: POST /v1/files (purpose: fine-tune) - Create job: POST /v1/fine_tuning/jobs model: "gpt-4o-mini-2024-07-18" suffix: "{client_id}-v{N}" - Poll: GET /v1/fine_tuning/jobs/{id} every 60s - On succeeded: extract fine_tuned_model string TrackB (Ollama via NoClaw): - Generate Modelfile from AccountState.system_prompt + base model - SSH/Tailscale to NoClaw host (100.86.248.8 or M3 Ultra) - Run: ollama create {client_id}-client:v{N} -f ./Modelfile - Register: ollama list to confirm creation - Model name pattern: "{client_id}-client:v{N}" EvalHarness: - Load data/clients/{client_id}/v{N}_eval.jsonl (holdout set) - For each eval record: call the new model with user message - Score: cosine_sim(model_response, expected_assistant_response) - Aggregate: problem_recall (exact match %), solution_accuracy (avg sim) - Regression guard: new eval_score must be ≥ prev_version score - 0.05 - On regression: abort promotion, alert ModelRegistry: - File: data/clients/{client_id}/model_registry.json - Schema: [{version, track, model_id, eval_score, active, created_at}] - promote(): set active=true on new, active=false on all prev versions - rollback(version): reactivate a prior version 2. CLI: pnpm fine-tune submit --client hre --version 3 --track a pnpm fine-tune submit --client teleios --version 1 --track b pnpm fine-tune eval --client hre --version 3 pnpm fine-tune rollback --client hre --version 2 3. Unit tests: - TrackA: mock OpenAI API, assert correct file upload + job creation - TrackB: mock Ollama CLI output, assert Modelfile generation - EvalHarness: assert regression guard triggers correctly - ModelRegistry: promote/rollback state transitions ## Env vars: OPENAI_API_KEY=sk-... OLLAMA_HOST=http://100.86.248.8:11434 CLIENT_DATA_DIR=./data/clients EVAL_REGRESSION_TOLERANCE=0.05 ## Track selection per client (data/clients/{id}/config.json): hre: track_preference: "a" (MVP phase) teleios: track_preference: "b" (sensitive data, no egress) ## Do NOT build Phase 5 (OpenClaw integration) yet.
← Phase 3: JSONL Pipeline gtm-autoresearch-docs.pages.dev All Docs →