Mycelia Present · rendered 2026-05-26T15:09:31.367Z · source: ../convivium/clients/emerson_fry/REMY_COHERENCE_CHECK_SPEC_v0.1.md

Remy Four-Layer Brand Coherence Check — Implementation Spec v0.1

Filed: 2026-05-26 by Mycelia For: Faber — implementation-ready engineering spec Grounded in: BRAND_OPERATING_v0.2.md §6.3 (Remy fourth-layer coherence check) + §6.4 (Model Profile coherence-gate stage) Status: v0.1 — Faber edits in place if scope reads wrong; ship updated version after build Priority: Post-Thu reveal. Not Thursday-blocking. This is Y1 architecture work that becomes Day-30+ in the engagement.

What it is + why

Right now Remy validates outputs against voice rules (lexical patterns + tier mappings + ban list). That's surface-layer only. A Remy output can follow every voice rule and still violate the brand's soul (saying something the painter would never say), the system (offering a sale during a "no sales ever" brand), or the story (inventing artisan-partnership details that don't exist).

The four-layer coherence check is a middleware that runs after generation and before external-surface ship. It asks four questions per output:

SOUL — does this respect the originating conviction (painter-discipline / restraint / curatorial-taste)?
SYSTEM — does this match operational reality (no-sale, made-to-order, named-artisan-partnerships, natural-fibers-only)?
STORY — does this serve a narrative the brand actually owns? (not invented, not borrowed-from-category)
SURFACE — does this read as unmistakably EF on first encounter? (voice rules, painter colors, "+", three voice tiers, natural-light register)

Output is gated by check result. Fail any layer → human review (founder approval for "e." tier, brand-approval for everything else). Pass all four → ship-eligible.

The check is the discipline. Volume amplifies whatever signal you produce; the check prevents drift-at-scale.

Where it slots in (architecture)

Current pipeline:

user_request → Remy.generate() → output → external surface (email, social, etc.)

New pipeline:

user_request → Remy.generate() → output → BrandCoherenceCheck.evaluate() → {pass | flag | hold} → human-review-gate (if needed) → external surface

Implementation surface: new middleware module lib/agent/brand-coherence-check.ts (or similar). Called by chat.ts after generation; called by lib/image/generate.ts (or wherever image generation lives) after image production.

Persistence: add brandCoherenceCheck: { passed: boolean, failedLayers: string[], reasoning: string, requiresApproval: boolean } to the existing generation record schema. Already-saved generations get the field on next access (default to null = not-yet-checked).

The four checks — operationally

Each check is an LLM-as-judge call to a fast model (Sonnet or Haiku — economics + latency favor Haiku for these short structured judgments). Single prompt per check OR one combined prompt per output (combined is cheaper; single-check-per-call is more debuggable). Recommend combined initially; split if reasoning quality drops.

SOUL check

Prompt skeleton:

Given EF's foundational soul (Emerson trained as a classical painter; the brand's discipline is "see longer, name precisely, restrain"), evaluate the following output. Does it respect this soul, or does it violate it (loud, trend-chasing, hyperbolic, status-coded)?

Reply with: PASS / FAIL + brief reasoning.

Fail signals to watch: hyperbole ("game-changing," "revolutionary"), urgency ("don't miss," "limited"), status-coding ("exclusive," "elite"), unrestraint, off-painterly metaphors.

SYSTEM check

Prompt skeleton:

EF's operational system has these structural rules: no sales ever (Memorial Day / July 4 / Labor Day / Black Friday all out); made-to-order discipline (sold-out is default state); named-artisan-partnerships (Italy / Portugal / Peru / USA / Rajasthan — these are the only countries, named relationships); natural-fibers-only (organic cotton, hemp, linen, tencel, wool, recycled — no synthetic blends in mainline copy).

Evaluate this output. Does it match these operational facts, or does it violate them?

Fail signals: sale/discount language, "back in stock" / "while supplies last" urgency, naming a country/partner not in EF's actual chain, claiming materials EF doesn't use, claiming features that don't exist operationally.

STORY check

Prompt skeleton:

EF owns these narrative truths (verified, sourced): 17-year founder voice continuity; the artisan-partnership depth (5 countries, named); the painter-discipline source; Spring Revival (archive-favorites brought back); the "great honor" verb register; the seasonal cadence (summer Love Tòmas now → fall return → Holiday → spring drop).

EF does NOT own: family-as-marketing-content (Billy hard-rule, kids stay private); inferred-as-true content (no school names without verification); status-luxury narratives; trend-cycle stories.

Evaluate this output. Does it serve a story EF actually owns, or does it invent / borrow?

Fail signals: family-story references in client-facing copy; specific school claims; cultural-positioning claims (e.g., "American Neo-Gothic") used as a tagline rather than press-attributed citation; any other inferred-as-true content.

SURFACE check

Prompt skeleton:

EF's surface markers: lowercase body copy; "+" instead of "and" in tag phrasing; "honor" / "treasured" / "heirloom" as load-bearing vocabulary; three voice tiers ("e." intimate / "Emerson" formal / "thank you Emerson Fry" brand-assured); painter color naming (Roussillon, Marigold, Sharon's Flowers, Putty, Cocoa, etc.); natural-light imagery descriptions; restraint over expression.

Evaluate this output. Does it read as unmistakably EF on first encounter, or does it read as category-generic?

Fail signals: "and" instead of "+"; uppercase body copy in body sections; standard fashion-DTC vocabulary; missing brand-color naming when colors are mentioned; voice-tier mismatch (e.g., "e." signature on a wholesale email).

Implementation pattern — combined check

For latency + cost, single LLM call per output covering all four layers:

async function brandCoherenceCheck(output: string, surface: string, context: BrandContext): Promise<CoherenceResult> {
  const prompt = `[system prompt with all 4 layer definitions]
Output to evaluate:
${output}

Target surface: ${surface}

For each layer (SOUL / SYSTEM / STORY / SURFACE), return PASS or FAIL + 1-sentence reasoning.
Then return overall verdict (PASS only if all 4 PASS) + required-approval-tier (NONE / BRAND / FOUNDER) + suggested fix if FAIL.

Format: JSON.`;
  
  const result = await sonnet({ prompt, maxTokens: 500 });
  return JSON.parse(result);
}

Estimated cost: ~$0.001-0.005 per check (Sonnet) or ~$0.0005-0.002 (Haiku). At 100 outputs/day = ~$0.10-0.50/day. Negligible.

Estimated latency: 1-3 seconds. Acceptable for non-real-time approval workflows. For real-time chat outputs, run check async + flag retroactively (don't block stream).

Risk-tiered eager-vs-lazy eval

Not all outputs need the same check rigor.

High-risk surfaces (eager-eval, BLOCK on fail):

Email sends (Klaviyo bound)
Paid social ad creative
Press pitches
Generated imagery slated for external distribution

Medium-risk surfaces (eager-eval, FLAG on fail but allow override):

Organic social posts
Pinterest pins
Bundle / collection page copy
Internal newsletters

Low-risk surfaces (lazy-eval, async-flag):

Internal staging drafts
Generation history (already-saved)
Exploratory chat in /admin

The risk tier informs default approval flow:

High-risk + fail → must go to founder ("e." tier) or brand-lead (other tiers) approval before send
Medium-risk + fail → flag, suggest fix, allow human override
Low-risk + fail → log, no blocking action

Integration with existing systems

chat.ts integration

Add a runCoherenceCheck() step after the generation completes + before storage:

// in chat.ts handleMessage()
const generation = await remyGenerate(...);
const coherenceResult = await brandCoherenceCheck(generation.text, surface, brandContext);
const recordWithCheck = {
  ...generation,
  brandCoherenceCheck: coherenceResult,
};
await saveGeneration(recordWithCheck);
return responseToClient({
  text: generation.text,
  brandCoherence: coherenceResult.verdict,
  requiresApproval: coherenceResult.requiredApprovalTier !== 'NONE',
});

Image-generation integration

Same pattern but with vision-capable model for the image evaluation. Different prompt — the four layers translate to visual:

SOUL: natural light? painterly composition? restrained?
SYSTEM: realistic scene that EF would actually produce? no off-aesthetic settings?
STORY: matches one of EF's owned narratives (Love Tòmas summer, atelier, farm, supply-chain country)?
SURFACE: EF's actual photographic register (soft warm tones, intimate, lifestyle-editorial, real-context)?

UI integration (already partially done)

Faber shipped approve-for-use button in d272866. The four-layer check feeds into this — the button now becomes the human-review step AFTER auto-check has run. If auto-check fails, the button is the "approve anyway" override (logged + reasoned).

Failure modes + recovery

Failure mode 1: LLM-as-judge wrong-positive (says PASS when actual fail).

Detection: human review samples 5-10% of approved outputs; track LLM-vs-human agreement
Recovery: tighten the prompt with adversarial examples; downgrade to "approval required" for marginal cases

Failure mode 2: LLM-as-judge wrong-negative (says FAIL when actual pass).

Detection: human approval-override rate exceeds 30%
Recovery: loosen specific layer checks; add explicit pass-examples to prompt

Failure mode 3: LLM-as-judge unavailable (Sonnet/Haiku timeout).

Detection: API error
Recovery: queue for retry; for high-risk surfaces, default to requires-approval (safe failure mode); for low-risk, log + allow ship

Failure mode 4: Brand context drift over time.

Detection: monthly review — does the four-layer prompt still match current True North findings?
Recovery: scheduled monthly Mycelia/Lumen review of the prompt; version + diff each update

Test cases (initial — extend as patterns emerge)

// SHOULD PASS
"the eyelet maxi is back in soft cerulean for early summer. + matching scrunchie set, naturally."

// SHOULD FAIL (SYSTEM — sale)
"limited time — 25% off all Love Tòmas through Memorial Day weekend"

// SHOULD FAIL (STORY — family content + invented painter school)
"our founder, Emerson — trained at NY Studio School + Grand Central Atelier — wove this with her twin daughters in mind"

// SHOULD FAIL (SOUL — trend-chasing hyperbole)  
"the must-have, viral, internet-breaking eyelet maxi everyone's obsessed with"

// SHOULD FAIL (SURFACE — voice-tier mismatch)
// Body of wholesale-partner email signed "warmly, e." — should be unsigned or "thank you Emerson Fry"

// SHOULD PASS
"new in atelier — the layering jacket in beach linen. + a fresh round of the lee in chamoisee. honored to make these for you."

Add 10-20 more across SOUL / SYSTEM / STORY / SURFACE failure modes before declaring v1 ready.

Migration plan (phased)

Phase 1 (post-Thu reveal — Week 1-2 of engagement):

Build the middleware
Wire into chat.ts for text outputs only
Eager-eval on high-risk surfaces (email + paid)
Async-eval + log for everything else
Human approval-override path live + logged

Phase 2 (Month 2):

Extend to image generation
Add vision-model check for editorial imagery
Build the human-approval UI in /admin (founder dashboard showing flagged-for-approval queue)

Phase 3 (Month 3+):

Tune prompt weekly based on agreement-rate data
Add monthly Mycelia/Lumen review cycle
Consider escalating to per-layer separate checks if combined-check accuracy degrades

Open questions for Faber

Persistence schema — is the current generation-record schema in .data/generated/profiles/* extensible without migration, or does adding brandCoherenceCheck require a write-migration script?
LLM call routing — does the platform already have a generic llm.call(prompt) interface I can use, or do I add the API integration here?
Vision-model availability — is Gemini's multimodal API the right call for image-coherence-check, or does the project already use a different vision model?
Founder-approval UI — do you want to extend the existing /admin/profiles approve-button to become the founder-approval queue, or build separate /admin/approvals route?
Caching — should identical-output checks be cached (e.g., if Remy regenerates exact same text, skip re-eval)? Cheap to implement; worth it.
Override logging — when a user overrides a FAIL verdict, capture the override reason in the record. Required for v0.2 prompt-tuning.
Per-tenant configuration — if Tapt eventually serves multiple brands, the four-layer prompt needs per-tenant context. Make config-file-driven from the start to avoid refactor pain.

Pick the ones you want to discuss; ignore the rest. Spec is provisional pending your implementation eye.

Why this matters for Tapt (and the architecture-stewardship pricing frame)

Without the coherence-check middleware, "Remy distributes content in your voice" is what we're selling. With the middleware, we're selling "Remy protects + amplifies the architecture that produces your brand premium." That's the difference between agency-retainer pricing + premium-stewardship pricing.

It's also the difference between getting an F if Ryan or Emerson catch a wrong-voice output, vs the system gating that output before it ships. The middleware is the engineering manifestation of the brand-architecture frame.

Build it post-Thu when there's calendar space + the engagement is grounded.

— Mycelia, 2026-05-26 03:45 ET