The Voice Memo → AI Loop — The Operator Stack

The Promise

Turn the 100+ small thoughts an operator has each day into a structured stream that feeds their AI stack — with five seconds of effort per capture.

The One-Sentence Setup

Most operators think their capture problem is discipline; it's actually friction — and the human voice is the lowest-friction capture surface they already carry.

The Core Insight

Operators have a hundred small thoughts a day that never get captured because writing them down has too much friction. Voice memos have near-zero friction — but raw audio doesn't feed an AI workflow. The Loop closes that gap: voice in, transcript out, classified, routed to the right file in the Operator Vault, audio archived. Five seconds of operator effort produces a structured artifact the rest of the Stack can consume. The win is not the transcript — it's that decisions, todos, people updates, and ideas all land in the surfaces where Reason + Act + Remember can already see them.

The Mechanism — The 5-Stage Loop

1. Capture (5 seconds of operator time)

What: speak the thought, into the phone or watch, the moment it lands. How: iPhone Voice Memos OR an Apple Watch face complication tied to a single Shortcut. The operator says "decision: X over Y because Z," or "follow up with NAME about THING by FRIDAY," or "new project idea: NAME — DESCRIPTION." No app switching. No typing. No thinking about where it goes. Miss this and: the thought evaporates within 90 seconds. The whole Loop is downstream of capture friction — if step one takes more than five seconds, the operator stops capturing within a week.

2. Transcribe (local, 5-15 seconds)

What: convert audio to text on the operator's own machine. How: Whisper running locally — either openai-whisper (Python, free) or whisper.cpp (faster on Apple Silicon). The audio never leaves the device. The Vocal Coach pipeline does this in ~3 seconds for a 30-second clip on an M-series Mac using the tiny model. Step up to base if names matter, small only if the operator is dictating technical jargon. Miss this and: sending audio to a cloud API (including OpenAI's hosted Whisper) means recordings of business decisions, people, and clients leave the operator's control. Local-first is non-negotiable here.

3. Classify (Claude API call, ~1 second, ~$0.001)

What: route the transcript by intent. How: a short Claude Sonnet 4.6 call with a tight system prompt — "classify this voice memo into one of [decision, todo, people, idea, note], extract any names/dates/projects mentioned, and return the file path it should append to." The persona is 50 words max. Output is structured JSON. Cost per memo: about a tenth of a cent. Miss this and: transcripts dumped into one giant Inbox.md become the same notes-app graveyard the operator was trying to escape. Classification is what turns audio into a workflow.

4. Route (filesystem write, instant)

What: append the classified transcript to the correct file in the Operator Vault. How: decisions append to 30_Decisions/YYYY-MM-DD_decisions.md. Todos append to the active project note in 10_Projects/. People updates append to the right 20_People/<name>.md. Ideas land in 00_Inbox/ for Sunday triage. Notes land in the matching project file or Inbox if ambiguous. Every entry stamped with timestamp + original audio filename. Miss this and: without a deterministic routing rule, the operator drifts back to manual filing — which is the exact friction the Loop was built to remove.

5. Archive (move + index)

What: preserve the original audio, clear the inbox. How: raw audio moves from inbox/ to archive/YYYY-MM/. Transcript stays in the vault, linked to the audio path. The Vocal Coach pipeline does this exact move — original audio is never deleted, only relocated. Future AI runs can re-process the audio if a better model ships. Miss this and: the inbox fills with stale audio, the vault loses its trust-by-cleanliness property, and the operator stops opening it.

The Pitfalls

The 30-minute memo. The Loop is built for 5-30 second thoughts. Long meetings need a different framework — recording plus a structured-summary pass, not voice-memo capture. Operators who try to run hour-long monologues through this Loop overload the classifier and pollute the vault.

Cloud Whisper. OpenAI's hosted Whisper API works and is cheap. It also sends every recording of every business conversation to a third party. Local-first matters more here than almost anywhere else in the Stack.

Skipping the classifier. Dumping raw transcripts into Inbox.md is the same as the old notes-app pattern with extra steps. The classifier is the difference between capture and workflow.

No archive rule. Without a cron job or pipeline step moving audio to archive/, the inbox bloats until the operator stops trusting the system.

Trusting Whisper on the first transcript. Whisper hallucinates names, dates, and proper nouns — especially with tiny and base. Spot-check the first 20 transcripts against the audio. Calibrate which fields to trust.

The Drill (this week)

Today, capture five voice memos throughout the day — phone, watch, whatever is closest. Tonight, transcribe all five (MacWhisper free tier or whisper file.m4a from the terminal). Read each transcript and label it: decision, todo, people, idea, note, noise. Count the ratio of useful artifacts to noise. If three or more of the five were genuinely useful, build the full Loop this week. If fewer, the trigger has too much friction — switch capture surfaces before automating anything downstream.

The Tools

Tool	Job	Cost
iPhone Voice Memos	Capture	Free
Apple Watch + Shortcuts	Wrist-level capture	Free
MacWhisper	GUI Whisper wrapper for Mac	$0-59 one-time
Superwhisper	Real-time dictation overlay	$0-15/mo
`openai-whisper` (Python)	Local Whisper, scriptable	Free
`whisper.cpp`	Faster local Whisper on Apple Silicon	Free
`imageio-ffmpeg`	Bundles ffmpeg without Homebrew	Free
Claude Sonnet 4.6 (API)	Classifier	~$0.001/memo

Whisper model choice: tiny (75 MB, ~3s per 30s clip, fine for English) is the default. Move to base (142 MB) if names matter. small (466 MB) only if dictating jargon — anything bigger is overkill for short memos.

Cross-references

The 4-Surface AI Stack — the Loop is the canonical Capture-to-Remember pipeline; it spans all four surfaces in one flow.
The Operator Vault — routing only works if the vault folder structure is in place.
The Memory Architecture — decisions and people-updates flow directly into the memory files Claude reads at the start of every session.
The Persona Cascade — the classifier needs a 50-word persona; same design rules apply at miniature scale.

One framework. One drill. One week at a time.

The Operator Stack is the architecture. Verala is the practice that runs it on your own communication delivery — voice, pitch, pause, presence. One foundation per week, until it's automatic.

Take the free 5-Foundation Voice Audit → · Book a 30-min intro call →