Remy Four-Layer Brand Coherence Check — Implementation Spec v0.1
Filed: 2026-05-26 by Mycelia
For: Faber — implementation-ready engineering spec
Grounded in: BRAND_OPERATING_v0.2.md §6.3 (Remy fourth-layer coherence check) + §6.4 (Model Profile coherence-gate stage)
Status: v0.1 — Faber edits in place if scope reads wrong; ship updated version after build
Priority: Post-Thu reveal. Not Thursday-blocking. This is Y1 architecture work that becomes Day-30+ in the engagement.
What it is + why
Right now Remy validates outputs against voice rules (lexical patterns + tier mappings + ban list). That's surface-layer only. A Remy output can follow every voice rule and still violate the brand's soul (saying something the painter would never say), the system (offering a sale during a "no sales ever" brand), or the story (inventing artisan-partnership details that don't exist).
The four-layer coherence check is a middleware that runs after generation and before external-surface ship. It asks four questions per output:
- SOUL — does this respect the originating conviction (painter-discipline / restraint / curatorial-taste)?
- SYSTEM — does this match operational reality (no-sale, made-to-order, named-artisan-partnerships, natural-fibers-only)?
- STORY — does this serve a narrative the brand actually owns? (not invented, not borrowed-from-category)
- SURFACE — does this read as unmistakably EF on first encounter? (voice rules, painter colors, "+", three voice tiers, natural-light register)
Output is gated by check result. Fail any layer → human review (founder approval for "e." tier, brand-approval for everything else). Pass all four → ship-eligible.
The check is the discipline. Volume amplifies whatever signal you produce; the check prevents drift-at-scale.
Where it slots in (architecture)
Current pipeline:
user_request → Remy.generate() → output → external surface (email, social, etc.)
New pipeline:
user_request → Remy.generate() → output → BrandCoherenceCheck.evaluate() → {pass | flag | hold} → human-review-gate (if needed) → external surface
Implementation surface: new middleware module lib/agent/brand-coherence-check.ts (or similar). Called by chat.ts after generation; called by lib/image/generate.ts (or wherever image generation lives) after image production.
Persistence: add brandCoherenceCheck: { passed: boolean, failedLayers: string[], reasoning: string, requiresApproval: boolean } to the existing generation record schema. Already-saved generations get the field on next access (default to null = not-yet-checked).
The four checks — operationally
Each check is an LLM-as-judge call to a fast model (Sonnet or Haiku — economics + latency favor Haiku for these short structured judgments). Single prompt per check OR one combined prompt per output (combined is cheaper; single-check-per-call is more debuggable). Recommend combined initially; split if reasoning quality drops.
SOUL check
Prompt skeleton:
Given EF's foundational soul (Emerson trained as a classical painter; the brand's discipline is "see longer, name precisely, restrain"), evaluate the following output. Does it respect this soul, or does it violate it (loud, trend-chasing, hyperbolic, status-coded)?
Reply with: PASS / FAIL + brief reasoning.
Fail signals to watch: hyperbole ("game-changing," "revolutionary"), urgency ("don't miss," "limited"), status-coding ("exclusive," "elite"), unrestraint, off-painterly metaphors.
SYSTEM check
Prompt skeleton:
EF's operational system has these structural rules: no sales ever (Memorial Day / July 4 / Labor Day / Black Friday all out); made-to-order discipline (sold-out is default state); named-artisan-partnerships (Italy / Portugal / Peru / USA / Rajasthan — these are the only countries, named relationships); natural-fibers-only (organic cotton, hemp, linen, tencel, wool, recycled — no synthetic blends in mainline copy).
Evaluate this output. Does it match these operational facts, or does it violate them?
Fail signals: sale/discount language, "back in stock" / "while supplies last" urgency, naming a country/partner not in EF's actual chain, claiming materials EF doesn't use, claiming features that don't exist operationally.
STORY check
Prompt skeleton:
EF owns these narrative truths (verified, sourced): 17-year founder voice continuity; the artisan-partnership depth (5 countries, named); the painter-discipline source; Spring Revival (archive-favorites brought back); the "great honor" verb register; the seasonal cadence (summer Love Tòmas now → fall return → Holiday → spring drop).
EF does NOT own: family-as-marketing-content (Billy hard-rule, kids stay private); inferred-as-true content (no school names without verification); status-luxury narratives; trend-cycle stories.
Evaluate this output. Does it serve a story EF actually owns, or does it invent / borrow?
Fail signals: family-story references in client-facing copy; specific school claims; cultural-positioning claims (e.g., "American Neo-Gothic") used as a tagline rather than press-attributed citation; any other inferred-as-true content.
SURFACE check
Prompt skeleton:
EF's surface markers: lowercase body copy; "+" instead of "and" in tag phrasing; "honor" / "treasured" / "heirloom" as load-bearing vocabulary; three voice tiers ("e." intimate / "Emerson" formal / "thank you Emerson Fry" brand-assured); painter color naming (Roussillon, Marigold, Sharon's Flowers, Putty, Cocoa, etc.); natural-light imagery descriptions; restraint over expression.
Evaluate this output. Does it read as unmistakably EF on first encounter, or does it read as category-generic?
Fail signals: "and" instead of "+"; uppercase body copy in body sections; standard fashion-DTC vocabulary; missing brand-color naming when colors are mentioned; voice-tier mismatch (e.g., "e." signature on a wholesale email).
Implementation pattern — combined check
For latency + cost, single LLM call per output covering all four layers:
async function brandCoherenceCheck(output: string, surface: string, context: BrandContext): Promise<CoherenceResult> {
const prompt = `[system prompt with all 4 layer definitions]
Output to evaluate:
${output}
Target surface: ${surface}
For each layer (SOUL / SYSTEM / STORY / SURFACE), return PASS or FAIL + 1-sentence reasoning.
Then return overall verdict (PASS only if all 4 PASS) + required-approval-tier (NONE / BRAND / FOUNDER) + suggested fix if FAIL.
Format: JSON.`;
const result = await sonnet({ prompt, maxTokens: 500 });
return JSON.parse(result);
}
Estimated cost: ~$0.001-0.005 per check (Sonnet) or ~$0.0005-0.002 (Haiku). At 100 outputs/day = ~$0.10-0.50/day. Negligible.
Estimated latency: 1-3 seconds. Acceptable for non-real-time approval workflows. For real-time chat outputs, run check async + flag retroactively (don't block stream).
Risk-tiered eager-vs-lazy eval
Not all outputs need the same check rigor.
High-risk surfaces (eager-eval, BLOCK on fail):
- Email sends (Klaviyo bound)
- Paid social ad creative
- Press pitches
- Generated imagery slated for external distribution
Medium-risk surfaces (eager-eval, FLAG on fail but allow override):
- Organic social posts
- Pinterest pins
- Bundle / collection page copy
- Internal newsletters
Low-risk surfaces (lazy-eval, async-flag):
- Internal staging drafts
- Generation history (already-saved)
- Exploratory chat in /admin
The risk tier informs default approval flow:
- High-risk + fail → must go to founder ("e." tier) or brand-lead (other tiers) approval before send
- Medium-risk + fail → flag, suggest fix, allow human override
- Low-risk + fail → log, no blocking action
Integration with existing systems
chat.ts integration
Add a runCoherenceCheck() step after the generation completes + before storage:
// in chat.ts handleMessage()
const generation = await remyGenerate(...);
const coherenceResult = await brandCoherenceCheck(generation.text, surface, brandContext);
const recordWithCheck = {
...generation,
brandCoherenceCheck: coherenceResult,
};
await saveGeneration(recordWithCheck);
return responseToClient({
text: generation.text,
brandCoherence: coherenceResult.verdict,
requiresApproval: coherenceResult.requiredApprovalTier !== 'NONE',
});
Image-generation integration
Same pattern but with vision-capable model for the image evaluation. Different prompt — the four layers translate to visual:
- SOUL: natural light? painterly composition? restrained?
- SYSTEM: realistic scene that EF would actually produce? no off-aesthetic settings?
- STORY: matches one of EF's owned narratives (Love Tòmas summer, atelier, farm, supply-chain country)?
- SURFACE: EF's actual photographic register (soft warm tones, intimate, lifestyle-editorial, real-context)?
UI integration (already partially done)
Faber shipped approve-for-use button in d272866. The four-layer check feeds into this — the button now becomes the human-review step AFTER auto-check has run. If auto-check fails, the button is the "approve anyway" override (logged + reasoned).
Failure modes + recovery
Failure mode 1: LLM-as-judge wrong-positive (says PASS when actual fail).
- Detection: human review samples 5-10% of approved outputs; track LLM-vs-human agreement
- Recovery: tighten the prompt with adversarial examples; downgrade to "approval required" for marginal cases
Failure mode 2: LLM-as-judge wrong-negative (says FAIL when actual pass).
- Detection: human approval-override rate exceeds 30%
- Recovery: loosen specific layer checks; add explicit pass-examples to prompt
Failure mode 3: LLM-as-judge unavailable (Sonnet/Haiku timeout).
- Detection: API error
- Recovery: queue for retry; for high-risk surfaces, default to requires-approval (safe failure mode); for low-risk, log + allow ship
Failure mode 4: Brand context drift over time.
- Detection: monthly review — does the four-layer prompt still match current True North findings?
- Recovery: scheduled monthly Mycelia/Lumen review of the prompt; version + diff each update
Test cases (initial — extend as patterns emerge)
// SHOULD PASS
"the eyelet maxi is back in soft cerulean for early summer. + matching scrunchie set, naturally."
// SHOULD FAIL (SYSTEM — sale)
"limited time — 25% off all Love Tòmas through Memorial Day weekend"
// SHOULD FAIL (STORY — family content + invented painter school)
"our founder, Emerson — trained at NY Studio School + Grand Central Atelier — wove this with her twin daughters in mind"
// SHOULD FAIL (SOUL — trend-chasing hyperbole)
"the must-have, viral, internet-breaking eyelet maxi everyone's obsessed with"
// SHOULD FAIL (SURFACE — voice-tier mismatch)
// Body of wholesale-partner email signed "warmly, e." — should be unsigned or "thank you Emerson Fry"
// SHOULD PASS
"new in atelier — the layering jacket in beach linen. + a fresh round of the lee in chamoisee. honored to make these for you."
Add 10-20 more across SOUL / SYSTEM / STORY / SURFACE failure modes before declaring v1 ready.
Migration plan (phased)
Phase 1 (post-Thu reveal — Week 1-2 of engagement):
- Build the middleware
- Wire into chat.ts for text outputs only
- Eager-eval on high-risk surfaces (email + paid)
- Async-eval + log for everything else
- Human approval-override path live + logged
Phase 2 (Month 2):
- Extend to image generation
- Add vision-model check for editorial imagery
- Build the human-approval UI in /admin (founder dashboard showing flagged-for-approval queue)
Phase 3 (Month 3+):
- Tune prompt weekly based on agreement-rate data
- Add monthly Mycelia/Lumen review cycle
- Consider escalating to per-layer separate checks if combined-check accuracy degrades
Open questions for Faber
- Persistence schema — is the current generation-record schema in
.data/generated/profiles/*extensible without migration, or does addingbrandCoherenceCheckrequire a write-migration script? - LLM call routing — does the platform already have a generic
llm.call(prompt)interface I can use, or do I add the API integration here? - Vision-model availability — is Gemini's multimodal API the right call for image-coherence-check, or does the project already use a different vision model?
- Founder-approval UI — do you want to extend the existing
/admin/profilesapprove-button to become the founder-approval queue, or build separate/admin/approvalsroute? - Caching — should identical-output checks be cached (e.g., if Remy regenerates exact same text, skip re-eval)? Cheap to implement; worth it.
- Override logging — when a user overrides a FAIL verdict, capture the override reason in the record. Required for v0.2 prompt-tuning.
- Per-tenant configuration — if Tapt eventually serves multiple brands, the four-layer prompt needs per-tenant context. Make config-file-driven from the start to avoid refactor pain.
Pick the ones you want to discuss; ignore the rest. Spec is provisional pending your implementation eye.
Why this matters for Tapt (and the architecture-stewardship pricing frame)
Without the coherence-check middleware, "Remy distributes content in your voice" is what we're selling. With the middleware, we're selling "Remy protects + amplifies the architecture that produces your brand premium." That's the difference between agency-retainer pricing + premium-stewardship pricing.
It's also the difference between getting an F if Ryan or Emerson catch a wrong-voice output, vs the system gating that output before it ships. The middleware is the engineering manifestation of the brand-architecture frame.
Build it post-Thu when there's calendar space + the engagement is grounded.
— Mycelia, 2026-05-26 03:45 ET