Two parallel execution paths — OpenAI API for fast cloud fine-tunes, Ollama/MLX for private local fine-tunes on M3 Ultra. Triggered automatically when JSONL version is ready.
files endpoint with v{N}.jsonl · purpose: fine-tune · returns file_idfine_tuning/jobs · model: gpt-4o-mini-2024-07-18 · training_file: file_id · suffix: hre-v3fine_tuning/jobs/{job_id} every 60s · status: queued → running → succeeded · emit progress eventsfine_tuned_model ID to data/clients/hre/model_registry.json · set active: true · run eval harnessFROM llama3.1:8b
# System prompt baked from AccountState.system_prompt
SYSTEM """
You are a conversion tracking expert for HRE.
GTM: GTM-XXXXXXX | ws_4 | 31 tags, 18 triggers, 22 variables
Key tags: GA4 Config, GA4 - Purchase, AW - All Purchases, sGTM Bridge
dataLayer: page_view, add_to_cart, purchase, generate_lead
DLV: ecommerce.value, ecommerce.items[].price
Google Ads: 123-456-7890 | PMAX - Core, Brand - HRE, Retargeting
Conversions: All Purchases (value: ecommerce.value), Lead Submit
Meta Pixel: HRE-pixel-id | CAPI match: ~68% | Stape: cnt_abc123
Known: PMAX $0 value — sGTM reads top-level ecommerce.value
but HRE pushes revenue inside items[].price * quantity
"""
# Fine-tune parameters
PARAMETER temperature 0.2
PARAMETER num_ctx 4096
PARAMETER stop "<|eot_id|>"
# Model metadata
LABEL client_id="hre"
LABEL version="v3"
LABEL account_state_version="1.2.0"
LABEL training_examples="112"
LABEL created="2026-04-07"
Feed the 10-question holdout set. Score: did the model identify the correct root cause?
Compare model solution to known-correct fix. Scored by cosine similarity to reference answer.
v{N} eval score must exceed v{N-1} by margin. Rollback triggered if regression detected.
$ pnpm fine-tune submit --client hre --version 3 --track a ────────────────────────────────────────────────────── Fine-Tune Runner — HRE v3 · Track A (OpenAI) ────────────────────────────────────────────────────── Loading JSONL... ✓ data/clients/hre/v3.jsonl (112 records) Uploading to OpenAI... ✓ file_id: file-abc123xyz Creating fine-tune job... ✓ job_id: ftjob-abc123 Model suffix: gpt-4o-mini-2024-07-18:hre-v3 Polling status... [00:00] queued [02:14] running — step 12/112 [08:31] running — step 67/112 [14:02] running — step 112/112 [15:44] succeeded Fine-tuned model: ft:gpt-4o-mini-2024-07-18:hre-v3 ────────────────────────────────────────────────────── Running eval harness (holdout: v3_eval.jsonl)... Problem recall: ✓ 84% Solution accuracy: ✓ 0.87 cosine sim Regression vs v2: ✓ +0.13 improvement ────────────────────────────────────────────────────── Registering model... ✓ model_registry.json updated Promoting to active... ✓ hre → ft:gpt-4o-mini:hre-v3 Archiving v2... ✓ archived ────────────────────────────────────────────────────── ✓ HRE v3 live — ready for Phase 5 OpenClaw routing Next: pnpm openclaw register --client hre --model ft:gpt-4o-mini:hre-v3
claude --dangerously-skip-permissions
# Phase 4: Fine-Tune Runner (Dual Track)
# Branch: feature/finetune-pipeline
# Repo: github.com/Organized-AI/gtm-autoresearch
## Context
Phases 1–3 complete. Phase 4 builds the runner that takes a
versioned JSONL file and submits it to either:
Track A: OpenAI fine-tuning API (cloud, gpt-4o-mini)
Track B: Ollama on M3 Ultra via NoClaw :11434 (local, Llama 3.1 8B)
## Task
Read AGENT-HANDOFF/ and PLANNING/ first. Then:
1. Create packages/fine-tune-runner/
FineTuneRunner interface:
- submit(client_id, version, track): Promise
- poll(job_id): Promise
- evaluate(client_id, version): Promise
- register(client_id, version, model_id, track): void
- promote(client_id, version): void (sets active: true, archives prev)
TrackA (OpenAI):
- Upload JSONL: POST /v1/files (purpose: fine-tune)
- Create job: POST /v1/fine_tuning/jobs
model: "gpt-4o-mini-2024-07-18"
suffix: "{client_id}-v{N}"
- Poll: GET /v1/fine_tuning/jobs/{id} every 60s
- On succeeded: extract fine_tuned_model string
TrackB (Ollama via NoClaw):
- Generate Modelfile from AccountState.system_prompt + base model
- SSH/Tailscale to NoClaw host (100.86.248.8 or M3 Ultra)
- Run: ollama create {client_id}-client:v{N} -f ./Modelfile
- Register: ollama list to confirm creation
- Model name pattern: "{client_id}-client:v{N}"
EvalHarness:
- Load data/clients/{client_id}/v{N}_eval.jsonl (holdout set)
- For each eval record: call the new model with user message
- Score: cosine_sim(model_response, expected_assistant_response)
- Aggregate: problem_recall (exact match %), solution_accuracy (avg sim)
- Regression guard: new eval_score must be ≥ prev_version score - 0.05
- On regression: abort promotion, alert
ModelRegistry:
- File: data/clients/{client_id}/model_registry.json
- Schema: [{version, track, model_id, eval_score, active, created_at}]
- promote(): set active=true on new, active=false on all prev versions
- rollback(version): reactivate a prior version
2. CLI:
pnpm fine-tune submit --client hre --version 3 --track a
pnpm fine-tune submit --client teleios --version 1 --track b
pnpm fine-tune eval --client hre --version 3
pnpm fine-tune rollback --client hre --version 2
3. Unit tests:
- TrackA: mock OpenAI API, assert correct file upload + job creation
- TrackB: mock Ollama CLI output, assert Modelfile generation
- EvalHarness: assert regression guard triggers correctly
- ModelRegistry: promote/rollback state transitions
## Env vars:
OPENAI_API_KEY=sk-...
OLLAMA_HOST=http://100.86.248.8:11434
CLIENT_DATA_DIR=./data/clients
EVAL_REGRESSION_TOLERANCE=0.05
## Track selection per client (data/clients/{id}/config.json):
hre: track_preference: "a" (MVP phase)
teleios: track_preference: "b" (sensitive data, no egress)
## Do NOT build Phase 5 (OpenClaw integration) yet.