gtm-autoresearch // feature/finetune-pipeline

Phase 2: Account State
Collector

Pulls live client account data from GTM, Google Ads MCP, and Pipeboard Meta MCP. Normalizes into a versioned AccountState JSON object injected as system prompt context into every training record.

// AccountState Schema
AccountState {
  client_id:        string                // "bioptimizers" | "teleios" | "rtt"
  snapshot_at:      ISO 8601 timestamp    // when this state was captured
  version:          semver string         // "1.0.0" → bump on meaningful change

  gtm: {
    container_id:    string                // "GTM-XXXXXXX"
    workspace_id:    string
    tags: [{
      name, type, firing_triggers[], paused, notes
    }]
    triggers: [{
      name, type, filter_conditions[], custom_event_filter
    }]
    variables: [{
      name, type, parameter (dataLayer key, JS var, etc.)
    }]
    datalayer_schema: string[]             // inferred event names from triggers
    published_vs_draft_diff: string       // summary of unpublished changes
  }

  google_ads: {
    account_id:      string                // "123-456-7890"
    campaigns: [{
      id, name, type (PMAX|SEARCH|DISPLAY|VIDEO), status, budget
    }]
    conversion_actions: [{
      id, name, tag_snippet, category, value_settings, enhanced_conversions
    }]
    linked_gtm_containers: string[]
    performance_flags: string[]            // known issues like "$0 value on PMAX"
  }

  meta: {
    pixel_id:        string
    ad_account_id:   string
    events_fired: [{
      event_name, source (pixel|capi|both), last_received
    }]
    capi_config: {
      enabled, access_token_set, stape_container_id, match_rate
    }
    custom_audiences: [{
      id, name, size_range, source
    }]
  }

  memory_summary:   string                // top 5 past fixes from claude-mem Chroma
  known_issues:     string[]             // manually flagged or auto-detected
  system_prompt:    string                // pre-rendered, ready to inject into JSONL
}
    
// Data Sources + Field Mapping

GTM MCP (Stape)

gtm-mcp.stape.ai/mcp
Container JSON export + workspace diff
Container export JSONJSON
Tag list (name, type, triggers)array
Trigger filter conditionsarray
Variable definitionsarray
Workspace vs published diffstring
Custom event names (inferred)string[]

Google Ads MCP

mcp.gaql.app
GAQL queries via TrueClicks MCP
Campaign IDs, names, typesarray
Conversion action IDs + namesarray
Enhanced Conversions flagbool
Linked GTM container IDsstring[]
Value settings per conversionobject
Account-level currency + timezonestring

Pipeboard Meta MCP

mcp.pipeboard.co/meta-ads-mcp
Pixel diagnostics + CAPI config
Pixel ID + dataset IDstring
Events fired (name, source)array
CAPI match rate estimatefloat
Stape container IDstring
Ad account structureobject
Custom audience listarray
// MCP Tool Calls Per Source
Source MCP Tool Output Field Notes
GTM export_container gtm.tags, triggers, variables Parse JSON → normalize tag schema
GTM list_workspaces gtm.workspace_id Always grab latest workspace
Google Ads GAQL: SELECT campaign.id, campaign.name, campaign.advertising_channel_type google_ads.campaigns Filter: status != REMOVED
Google Ads GAQL: SELECT conversion_action.id, conversion_action.name, conversion_action.tag_snippets google_ads.conversion_actions Include value_settings object
Meta get_pixels meta.pixel_id, events_fired Pull last 7d event diagnostics
Meta get_custom_audiences meta.custom_audiences Size + source only (no PII)
claude-mem Chroma similarity query memory_summary Top 5 by relevance to client_id
// GTM Container JSON → Normalized Schema Transform
01
Export Raw Container

Call GTM MCP export_container(containerId, workspaceId). Raw export is deeply nested Google format with numeric type codes. Save as raw_gtm_{timestamp}.json for versioning.

02
Tag Normalization

Map tag.type numeric codes → human labels (e.g. ua→"Universal Analytics", gclidw→"Google Ads Conversion", html→"Custom HTML"). Extract firing trigger names by joining on firingTriggerId.

03
Trigger Filter Extraction

Parse filter[] conditions on each trigger. Custom Event triggers → extract customEventFilter[].parameter.value as the dataLayer event name. These become the datalayer_schema[] array.

04
Variable Parameter Mapping

For each variable: extract type (dlv/jsm/k/etc.) and parameter (the actual dataLayer key path like ecommerce.value). This tells the fine-tuned model exactly where data lives in the client's dataLayer schema.

05
Published vs Draft Diff

Compare live container export against current workspace. Any tags/triggers added in workspace but not yet published get flagged in published_vs_draft_diff. Prevents training on configs that aren't live yet.

06
System Prompt Render

Flatten the full AccountState into a compact string suitable for the system field in training JSONL. Target: under 800 tokens. Use abbreviation for large arrays (e.g. "27 tags including: GA4 Config, PMAX Conversion, [+25 more]").

// Rendered System Prompt Output (HRE Example)
You are a conversion tracking expert for HRE. GTM: GTM-XXXXXXX | Workspace: ws_4 | 31 tags, 18 triggers, 22 variables Key tags: GA4 Config, GA4 - Purchase, AW - All Purchases (PMAX), sGTM Bridge - Ecommerce, Custom HTML - DataLayer Push Key triggers: Page View, Purchase (dataLayer: 'purchase'), Lead Submit (dataLayer: 'generate_lead') dataLayer schema: page_view, view_item, add_to_cart, begin_checkout, purchase, generate_lead DLV mappings: ecommerce.value, ecommerce.currency, ecommerce.items[].item_id, ecommerce.items[].price Google Ads: 123-456-7890 | Currency: USD Campaigns: PMAX - Core (PMAX), Brand - HRE (SEARCH), Retargeting - Site Visitors (DISPLAY) Conversions: All Purchases (AW-xxx/yyy, value: ecommerce.value), Lead Form Submit (AW-xxx/zzz, count-only) Enhanced Conversions: ENABLED Meta Pixel: HRE-pixel-id | Ad Account: act_12345 Events (browser+CAPI): PageView, ViewContent, AddToCart, InitiateCheckout, Purchase CAPI match rate: ~68% | Stape container: cnt_abc123 Known issues: PMAX conversion shows $0 value — root cause: sGTM purchase tag reads top-level ecommerce.value but HRE pushes revenue inside items[].price * quantity. Past fix: updated sGTM variable to sum items array. Memory: Cart abandonment tracking fixed 2025-11, dataLayer conflict resolved 2025-09, PMAX value fix deployed 2026-01

Snapshot Versioning Strategy

AccountState is versioned with semver. Patch bump (1.0.x): campaign names changed, new tag added. Minor bump (1.x.0): dataLayer schema changed, conversion action added. Major bump (x.0.0): platform migrated, GTM container rebuilt. Each experiment record stores the AccountState version it was generated under — if you need to retrain from scratch after a major account restructure, you can filter to only post-migration experiments by version range.

// Claude Code Prompt
claude --dangerously-skip-permissions # Phase 2: Account State Collector # Branch: feature/finetune-pipeline # Repo: github.com/Organized-AI/gtm-autoresearch ## Context Phase 1 (ExperimentLogger) is complete. Now building the Account State Collector that pulls live client data from GTM, Google Ads, and Meta, normalizes it into a versioned AccountState object, and renders a compact system prompt string. ## Task Read AGENT-HANDOFF/ and PLANNING/ first. Then: 1. Create packages/account-state-collector/ AccountState TypeScript interface (full schema): - client_id, snapshot_at, version (semver) - gtm: { container_id, workspace_id, tags[], triggers[], variables[], datalayer_schema[], draft_diff } - google_ads: { account_id, campaigns[], conversion_actions[], enhanced_conversions, performance_flags[] } - meta: { pixel_id, ad_account_id, events_fired[], capi_config{}, custom_audiences[] } - memory_summary (string from Chroma top-5 query) - known_issues (string[]) - system_prompt (pre-rendered, target ≤800 tokens) 2. Collectors (one file per source): - gtm-collector.ts → calls GTM MCP (gtm-mcp.stape.ai/mcp) export_container + list_workspaces Normalize: map tag type codes → labels, extract datalayer_schema from customEventFilter values - google-ads-collector.ts → GAQL via TrueClicks MCP Queries: campaigns + conversion_actions (see PLANNING/) - meta-collector.ts → Pipeboard Meta MCP get_pixels + get_custom_audiences - memory-collector.ts → query Chroma :37777 by client_id, return top 5 relevant memories as summary string 3. system-prompt-renderer.ts Flatten AccountState → compact string ≤800 tokens Abbreviate large arrays: "27 tags including: X, Y, [+25 more]" Include known_issues and memory_summary at bottom 4. Versioning: - Compare new snapshot vs previous stored version - Semver bump logic: schema change = minor, platform migration = major, minor changes = patch - Store at: data/clients/{client_id}/account_state_v{N}.json 5. CLI: `pnpm account-state collect --client bioptimizers` Runs all collectors, renders system_prompt, saves versioned file 6. Unit tests for: - GTM tag normalization (type code → label) - datalayer_schema extraction from trigger filters - system_prompt token count guard (assert ≤800) - semver bump logic ## Env vars: GTM_MCP_URL=https://gtm-mcp.stape.ai/mcp GOOGLE_ADS_MCP_URL=https://mcp.gaql.app/sse/google-ads/... META_MCP_URL=https://mcp.pipeboard.co/meta-ads-mcp CHROMA_URL=http://localhost:37777 CLIENT_DATA_DIR=./data/clients ## Clients to test against: - hre (GTM-XXXXXXX, primary repo client) - teleios (GTM-WM5S3WSG) Do NOT build Phase 3 (training data pipeline) yet.