Phase 4 — Fine-Tune Runner

Two Execution Tracks

OpenAI Fine-Tune API

Cloud · gpt-4o-mini · $5–15 per client

Target model

gpt-4o-mini

Trigger

JSONL v{N} ready + ≥50 examples

API flow

Upload file → create job → poll → register model ID

Infra

OpenAI cloud — no local GPU needed

Data egress

Training data leaves network

Best for

New clients while data accumulates, fast MVP

Est. cost

~$5–15 per training run (100 examples)

Ollama Local (NoClaw)

M3 Ultra · :11434 · zero data egress

Target model

Llama 3.1 8B or Mistral

Trigger

Same gate — JSONL v{N} + ≥50 examples

API flow

JSONL → Modelfile → ollama create → register

Infra

M3 Ultra 512GB (Tailscale 100.x.x.x) via NoClaw

Data egress

Zero — stays on Tailscale network

Best for

Sensitive clients (Teleios), long-term production

Est. cost

Electricity only — ~$0 marginal cost

Track Selection Matrix

Scenario

Track A (OpenAI)

Track B (Ollama)

Recommendation

New client, <200 examples

✓ Good baseline fast

Possible, lower quality

Mature client, 500+ examples

✓ Strong performance

Healthcare / sensitive data

✗ Data egress risk

✓ Zero egress

Rapid iteration / testing

✓ Fast turnaround

Slower iteration cycle

Production serving at scale

Per-token cost accrues

✓ Fixed infra cost

Track A — OpenAI API Flow

01 Upload

→
POST files endpoint with v{N}.jsonl · purpose: fine-tune · returns file_id

02 Create

→
POST fine_tuning/jobs · model: gpt-4o-mini-2024-07-18 · training_file: file_id · suffix: hre-v3

03 Poll

→
GET fine_tuning/jobs/{job_id} every 60s · status: queued → running → succeeded · emit progress events

04 Register

→
On success: write fine_tuned_model ID to data/clients/hre/model_registry.json · set active: true · run eval harness

Track B — Ollama Modelfile + Local Flow

    FROM llama3.1:8b

# System prompt baked from AccountState.system_prompt
SYSTEM """
You are a conversion tracking expert for HRE.
GTM: GTM-XXXXXXX | ws_4 | 31 tags, 18 triggers, 22 variables
Key tags: GA4 Config, GA4 - Purchase, AW - All Purchases, sGTM Bridge
dataLayer: page_view, add_to_cart, purchase, generate_lead
DLV: ecommerce.value, ecommerce.items[].price

Google Ads: 123-456-7890 | PMAX - Core, Brand - HRE, Retargeting
Conversions: All Purchases (value: ecommerce.value), Lead Submit

Meta Pixel: HRE-pixel-id | CAPI match: ~68% | Stape: cnt_abc123

Known: PMAX $0 value — sGTM reads top-level ecommerce.value
       but HRE pushes revenue inside items[].price * quantity
"""

# Fine-tune parameters
PARAMETER temperature 0.2
PARAMETER num_ctx 4096
PARAMETER stop "<|eot_id|>"

# Model metadata
LABEL client_id="hre"
LABEL version="v3"
LABEL account_state_version="1.2.0"
LABEL training_examples="112"
LABEL created="2026-04-07"
  

Model Registry — data/clients/{id}/model_registry.json

Client

Version

Track

Eval Score

Model ID

Status

hre

0.84

ft:gpt-4o-mini:hre-v3

● active

hre

0.71

ft:gpt-4o-mini:hre-v2

archived

teleios

pending

hre-client-teleios:v1

◐ eval

rtt

—

awaiting 50 examples

queued

Eval Harness — Auto-Score Before Promoting

Problem Recall

Feed the 10-question holdout set. Score: did the model identify the correct root cause?

84%

Solution Accuracy

Compare model solution to known-correct fix. Scored by cosine similarity to reference answer.

0.87 sim

Regression Guard

v{N} eval score must exceed v{N-1} by margin. Rollback triggered if regression detected.

+0.13 Δ

CLI Output — pnpm fine-tune submit --client hre --version 3

$ pnpm fine-tune submit --client hre --version 3 --track a

──────────────────────────────────────────────────────
  Fine-Tune Runner — HRE v3 · Track A (OpenAI)
──────────────────────────────────────────────────────
  Loading JSONL...           ✓ data/clients/hre/v3.jsonl (112 records)
  Uploading to OpenAI...     ✓ file_id: file-abc123xyz
  Creating fine-tune job...  ✓ job_id: ftjob-abc123
  Model suffix:              gpt-4o-mini-2024-07-18:hre-v3

  Polling status...
    [00:00] queued
    [02:14] running — step 12/112
    [08:31] running — step 67/112
    [14:02] running — step 112/112
    [15:44] succeeded

  Fine-tuned model:          ft:gpt-4o-mini-2024-07-18:hre-v3

──────────────────────────────────────────────────────
  Running eval harness (holdout: v3_eval.jsonl)...
    Problem recall:          ✓ 84%
    Solution accuracy:       ✓ 0.87 cosine sim
    Regression vs v2:        ✓ +0.13 improvement

──────────────────────────────────────────────────────
  Registering model...       ✓ model_registry.json updated
  Promoting to active...     ✓ hre → ft:gpt-4o-mini:hre-v3
  Archiving v2...            ✓ archived
──────────────────────────────────────────────────────
  ✓ HRE v3 live — ready for Phase 5 OpenClaw routing
  Next: pnpm openclaw register --client hre --model ft:gpt-4o-mini:hre-v3

Claude Code Prompt

claude --dangerously-skip-permissions

# Phase 4: Fine-Tune Runner (Dual Track)
# Branch: feature/finetune-pipeline
# Repo: github.com/Organized-AI/gtm-autoresearch

## Context
Phases 1–3 complete. Phase 4 builds the runner that takes a
versioned JSONL file and submits it to either:
  Track A: OpenAI fine-tuning API (cloud, gpt-4o-mini)
  Track B: Ollama on M3 Ultra via NoClaw :11434 (local, Llama 3.1 8B)

## Task

Read AGENT-HANDOFF/ and PLANNING/ first. Then:

1. Create packages/fine-tune-runner/

   FineTuneRunner interface:
   - submit(client_id, version, track): Promise
   - poll(job_id): Promise
   - evaluate(client_id, version): Promise
   - register(client_id, version, model_id, track): void
   - promote(client_id, version): void  (sets active: true, archives prev)

   TrackA (OpenAI):
   - Upload JSONL: POST /v1/files (purpose: fine-tune)
   - Create job: POST /v1/fine_tuning/jobs
     model: "gpt-4o-mini-2024-07-18"
     suffix: "{client_id}-v{N}"
   - Poll: GET /v1/fine_tuning/jobs/{id} every 60s
   - On succeeded: extract fine_tuned_model string

   TrackB (Ollama via NoClaw):
   - Generate Modelfile from AccountState.system_prompt + base model
   - SSH/Tailscale to NoClaw host (100.86.248.8 or M3 Ultra)
   - Run: ollama create {client_id}-client:v{N} -f ./Modelfile
   - Register: ollama list to confirm creation
   - Model name pattern: "{client_id}-client:v{N}"

   EvalHarness:
   - Load data/clients/{client_id}/v{N}_eval.jsonl (holdout set)
   - For each eval record: call the new model with user message
   - Score: cosine_sim(model_response, expected_assistant_response)
   - Aggregate: problem_recall (exact match %), solution_accuracy (avg sim)
   - Regression guard: new eval_score must be ≥ prev_version score - 0.05
   - On regression: abort promotion, alert

   ModelRegistry:
   - File: data/clients/{client_id}/model_registry.json
   - Schema: [{version, track, model_id, eval_score, active, created_at}]
   - promote(): set active=true on new, active=false on all prev versions
   - rollback(version): reactivate a prior version

2. CLI:
   pnpm fine-tune submit --client hre --version 3 --track a
   pnpm fine-tune submit --client teleios --version 1 --track b
   pnpm fine-tune eval --client hre --version 3
   pnpm fine-tune rollback --client hre --version 2

3. Unit tests:
   - TrackA: mock OpenAI API, assert correct file upload + job creation
   - TrackB: mock Ollama CLI output, assert Modelfile generation
   - EvalHarness: assert regression guard triggers correctly
   - ModelRegistry: promote/rollback state transitions

## Env vars:
OPENAI_API_KEY=sk-...
OLLAMA_HOST=http://100.86.248.8:11434
CLIENT_DATA_DIR=./data/clients
EVAL_REGRESSION_TOLERANCE=0.05

## Track selection per client (data/clients/{id}/config.json):
hre: track_preference: "a"   (MVP phase)
teleios: track_preference: "b"  (sensitive data, no egress)

## Do NOT build Phase 5 (OpenClaw integration) yet.

← Phase 3: JSONL Pipeline gtm-autoresearch-docs.pages.dev All Docs →

Fine-TuneRunner Dual Track

OpenAI Fine-Tune API

Ollama Local (NoClaw)

Problem Recall

Solution Accuracy

Regression Guard

Fine-Tune
Runner Dual Track