13 min read·By Brian Ellin

The Complete Guide to Doing: Voice & Screenshots for AI Builders

In this guide

Installation
Onboarding
Recording
The floating pip
Auto-paste
YOLO Mode
Screenshots
Audio ducking
Transcription providers
Language settings
Words and dictionaries
Filler word removal
Skills
Transcript files
Menu bar mode
Keyboard shortcuts reference
Troubleshooting
What's next

Doing is voice and screenshots for AI builders. Hold fn, talk, release — your transcription pastes instantly wherever your cursor is, entirely on-device. Drag to capture screenshots while recording to give your AI tools visual context alongside your voice.

This guide covers everything: installation, setup, daily usage, and tips for getting the most out of Doing.

Installation

Doing runs on macOS 14 (Sonoma) or later with Apple Silicon (M1+).

Download the latest .dmg from doing.tools
Open the .dmg and drag Doing to your Applications folder
Launch Doing — the onboarding flow will walk you through permissions and hotkey setup

Doing runs in your menu bar. After setup, it stays out of your way until you press your hotkey.

Onboarding

The first time you launch Doing, you'll walk through a short setup flow:

Hotkey selection

Pick the key you'll hold to record. The app recommends options based on your keyboard:

Hotkey	Best for	Notes
fn	Apple keyboards	Most natural — like a walkie-talkie button. Disable the emoji picker in System Settings → Keyboard first.
Option+Space	Any keyboard	Works with third-party keyboards (Logitech, Keychron, etc.) where fn isn't detectable.
Right Command	Quick access, left-handers	Dedicated key that's easy to reach — especially great for left-handers who keep their right hand on the keyboard.
Custom shortcut	Power users	Any key + modifier combo (e.g., Ctrl+Shift+R).

Tip

We recommend fn for most people — it's a dedicated key that feels like a walkie-talkie button. Just disable the emoji picker first in System Settings → Keyboard. If fn doesn't work for you, Right Command is a great alternative.

Permissions

Doing needs two macOS permissions:

Microphone — required to record your voice. macOS will prompt you automatically.
Accessibility — required to paste transcriptions into other apps. Doing simulates a Cmd+V keypress, which needs Accessibility access. The onboarding flow links you to the right System Settings panel.

Both permissions are one-time grants. If you move the app to a different folder or update to a new build, macOS may ask again.

Test recording

The onboarding includes a test step where you hold your hotkey and speak. You'll see the floating pip (a small keycap-shaped indicator near your cursor) light up with a waveform. Release, and the app transcribes your speech. If it works, you're ready.

Recording

Walkie-talkie mode (default)

The basic flow:

Hold your hotkey
Talk — the pip shows a live waveform near your cursor
Release — Doing transcribes your speech and pastes it into the active app

That's it. The whole cycle takes a few seconds depending on your transcription provider. With Parakeet (the default local engine), transcription is nearly instant.

Hands-free mode

Sometimes you need to talk for longer — dictating a detailed prompt, explaining a bug, or walking through a complex change. Hands-free mode lets you record without holding a key.

Hold your hotkey — recording starts normally
Tap Shift while still holding the hotkey — a lock icon appears on the pip
Release the hotkey — recording continues
Press the hotkey again when you're done — recording stops and transcription begins

The lock badge on the pip tells you hands-free mode is active. You're free to use your keyboard and mouse normally while recording.

Tip

Hands-free mode is especially useful when paired with AI coding tools. Start recording, describe what you want while looking at the code, then press your hotkey to stop. Your prompt gets transcribed and pasted right into the chat.

The floating pip

The pip is the small keycap-shaped indicator that follows your cursor during recording. It gives you visual feedback without taking your eyes off your work. Click each state to see what the pip looks like:

The pip automatically appears when recording starts and fades out when transcription completes.

Auto-paste

By default, Doing automatically pastes transcribed text into whatever app is focused. It copies the text to your clipboard and simulates Cmd+V via the Accessibility API.

You can disable auto-paste in Settings if you'd rather just have text copied to the clipboard without pasting.

YOLO Mode

YOLO Mode automatically presses Return after pasting your transcription. You talk, release your hotkey, and your prompt is already submitted — no review step, no editing. When active, the pip shows a ⏎ symbol as a reminder.

This is how you want to interact with LLMs. Claude Code, ChatGPT, Cursor — they all understand natural speech just fine. A rambling 80-word spoken prompt with a few "ums" gives an LLM far more to work with than a polished 10-word typed command. There's no reason to review and edit before submitting.

Toggle it in Settings → General or from the main app window.

Tip

This may sound reckless, but try it for a day and you probably won't go back. You'll be surprised how much faster you move when you stop second-guessing your phrasing before hitting send.

Screenshots

Screenshots are what make Doing more than a voice tool. While recording, you can capture regions of your screen and attach them to your transcription — giving your AI tools visual context alongside your voice.

Why this matters

Think about how you naturally explain a problem to a coworker: you point at your screen and talk. "See this layout? The spacing is off here, and this button is getting cut off on mobile." You give them your words AND what you're looking at.

Doing lets you do exactly that with AI tools. Instead of describing what's on your screen in words (and hoping the AI understands), you show it:

"Fix the layout issue in this component — the cards are overlapping on mobile and the badge is getting clipped" + screenshot of the actual layout

The AI gets your spoken context AND the visual. One recording replaces a prompt, a screenshot, uploading the image, and explaining what to look at.

How to capture

While recording (in either walkie-talkie or hands-free mode):

Click and drag on any part of your screen to select a region
A dotted selection rectangle appears with dimensions
Release to capture

You can take multiple screenshots during a single recording session. A badge on the pip shows the count.

app.example.com

Hold hotkey to start

Where screenshots go

Screenshots are saved as PNGs organized by date: YYYY/MM/DD/HHMMSS-001.png. You can set a custom save directory in Settings → Screenshots.

Note

Normal clicking and window dragging work as expected during recording. Only click-and-drag on content areas triggers screenshot capture — dragging a window by its title bar moves the window normally.

Enabling screenshots

Screenshot capture is enabled by default. You can toggle it in Settings → Screenshots. The app needs Screen Recording permission the first time you use it — macOS will prompt you automatically.

Audio ducking

Doing automatically fades your system audio when you start recording and brings it right back when you stop. If you're listening to music or a podcast, you don't need to fumble for pause — Doing handles it.

This works with any audio source: Spotify, Apple Music, YouTube, podcasts, whatever is playing through your Mac's audio system.

You can adjust the ducking level or disable it entirely in Settings → Audio.

Transcription providers

Doing supports multiple transcription backends. You can switch between them in Settings → Transcription.

Parakeet (default — local)

NVIDIA's Parakeet TDT model running locally on your Mac via FluidAudio. No internet required, completely private, and fast — roughly 190x real-time speed on M4 Pro.

Cost: Free
Latency: Near-instant (a few hundred milliseconds for most recordings)
Privacy: Audio never leaves your machine
Languages: English only (v2) or 25 European languages with auto-detection (v3)

Parakeet downloads a ~200MB model the first time you use it. After that, everything runs locally.

Apple Transcription (macOS 26+)

Uses Apple's on-device foundation model introduced in macOS 26 (Tahoe). High quality, private, and free — but requires macOS 26 or later.

Cost: Free
Privacy: Fully on-device
Setup: Select apple as your provider in Settings

OpenAI Whisper

OpenAI's cloud-based transcription API. Excellent accuracy across many languages.

Cost: $0.006 per minute of audio
Setup: Add your OpenAI API key in Settings → Transcription
Languages: 50+ languages

Google Gemini

Google's multimodal model for transcription.

Cost: ~$0.0001 per second
Setup: Add your Gemini API key in Settings

AssemblyAI

Cloud transcription with optional LLM gateway for post-processing.

Cost: $0.015/min (batch) or $0.035/min (streaming)
Setup: Add your AssemblyAI API key in Settings

Tip

Start with Parakeet. It's free, fast, private, and good enough for most English transcription. Switch to a cloud provider only if you need better accuracy for specialized vocabulary or non-English languages.

Benchmark tool

Not sure which engine is best for you? Don't guess — test. Doing ships with a built-in benchmark tool that runs the same audio through every available engine side by side. You get real numbers: transcription time, word count, and output quality for your specific voice and vocabulary.

Find it in Settings → Transcription → Benchmark.

Language settings

Set your transcription language in Settings → Transcription. Options:

Auto-detect (default) — the transcription engine figures it out
Specific language — set an ISO code like en, es, fr, de, ja, ko, etc.

Parakeet v3 supports 25 European languages with auto-detection. Cloud providers generally support 50+ languages.

Words and dictionaries

Transcription engines are trained on general speech. They don't know your company's product names, your coworkers' names, or niche developer terminology. "Supabase" becomes "super base." "Vercel" becomes "versatile."

Doing fixes this with dictionary packs and personal words.

Dictionary packs

Doing ships with three built-in packs covering developer vocabulary:

AI Engineering (180+ words) — models, frameworks, and concepts (Claude, LangChain, RAG, embeddings, etc.)
Software Engineering (260+ words) — languages, tools, and infrastructure (TypeScript, PostgreSQL, Docker, Terraform, etc.)
Product & Business (220+ words) — product management and startup vocabulary (Jira, Figma, ARR, OKR, etc.)

All three are enabled by default. Disable any you don't need in Settings → Words.

Personal words

For words the packs don't cover — your company name, internal tools, teammates — add them in Settings → Words → My Words.

Each word maps a correct spelling to common transcription mistakes:

Datadog ← data dog, datadock
Supabase ← super base, superbased

Type the correct spelling, press Enter, then optionally add variants (the wrong versions the transcription engine produces). Your personal words always take priority over pack entries.

For a deeper dive, see the Words and Dictionaries guide.

Filler word removal

Doing automatically strips filler words — "um," "uh," "er," "ah," "hmm" — from every transcription before pasting. This happens alongside dictionary corrections, so you get clean text without manual editing.

The default filler word list covers the most common ones. You can customize it in your config file at ~/.config/doing/config.yaml under the filler_words key.

Skills

Skills are post-transcription LLM processing that transforms your raw speech into something more polished or structured. Think of them as filters that run between transcription and paste.

Built-in skills

Doing ships with several skills out of the box. Here's what they actually do to your text:

Raw transcription (no skill):

"okay so um the pricing page has like three cards and they look fine on desktop but on mobile they're all squished together and the popular badge thing is getting cut off"

After Cleanup:

"The pricing page has three cards that look fine on desktop but on mobile they're squished together and the popular badge is getting cut off."

After Formalize:

"The pricing page displays three plan cards. While the layout renders correctly on desktop, the mobile view presents two issues: insufficient spacing between cards and the 'Popular' badge being clipped."

Same voice input, different outputs depending on where it's going. Raw for Claude Code (it doesn't care about grammar). Cleanup for Slack. Formalize for email.

Per-app skills

This is where skills get powerful. You can configure different skills for different apps, and they trigger automatically based on where your cursor is:

Gmail → Cleanup + Formalize (professional emails)
Slack → Cleanup only (casual but clean)
Claude Code → no skills (raw transcription for prompts)
Terminal → no skills

Configure per-app skills in Settings → Skills or in your config file:

app_skills:
  Gmail: "Cleanup, Formalize"
  Slack: "Cleanup"
  "*": "Cleanup"  # fallback for all other apps

Custom skills

Create your own skills by adding a SKILL.md file in ~/.config/doing/skills/{name}/:

---
name: My Custom Skill
description: What this skill does
---

Your prompt here. This is sent to the LLM along with the transcription.
Transform the text according to these instructions...

Custom skills appear in Settings alongside the built-in ones.

LLM provider for skills

Skills need an LLM to run. Doing automatically picks one based on what's available:

Your configured skills provider (if set)
Any API key you've already added (OpenAI, Gemini, AssemblyAI)
Apple's on-device model (macOS 26+, free)

Transcript files

Every transcription is saved to a daily markdown file at ~/.config/doing/ (or your custom transcript directory). The format is Obsidian-compatible with Dataview inline fields:

## 2026-03-25

### 09:45

This is what I said during that recording session.

- Words: 42 :: Duration: 6s :: Provider: parakeet :: Cost: $0.00 :: Target: Claude Code
- Running total: 42 words, 6s

Each entry includes metadata: word count, recording duration, provider, cost, and which app received the text. Running totals accumulate throughout the day.

Tip

If you use Obsidian, point your transcript directory to a vault folder. You can then build Dataview queries like "all transcriptions pasted into Gmail this week" or "total words dictated per day."

By default, Doing shows in both the menu bar and the Dock. If you prefer a cleaner setup:

Go to Settings → General and toggle off Show in Dock. Doing will run as a menu bar-only app — hidden from the Dock and Cmd+Tab switcher. The hotkey and all features work exactly the same.

Keyboard shortcuts reference

Action	Shortcut
Start recording	Hold your configured hotkey
Stop recording	Release hotkey (walkie-talkie) or press hotkey again (hands-free)
Activate hands-free	Tap Shift while holding hotkey
Capture screenshot	Click and drag during recording

Troubleshooting

Hotkey not working

fn key: Make sure the emoji picker is disabled in System Settings → Keyboard. Set "Press fn key to" to "Do Nothing" or "Change Input Source."
Third-party keyboard: The fn key likely won't work. Switch to Option+Space or a custom shortcut.
After macOS update: Accessibility permission may need to be re-granted. Check System Settings → Privacy & Security → Accessibility.

Text not pasting

Verify Accessibility permission is granted for Doing in System Settings → Privacy & Security → Accessibility.
If you recently moved or renamed the app, macOS revokes the permission. Re-add it.
If using YOLO Mode and text is submitting in the wrong app, make sure the correct app is focused before you start recording.

Transcription quality issues

Check your microphone input in System Settings → Sound → Input.
Try speaking closer to the mic.
For specialized vocabulary, add words to your personal dictionary in Settings → Words.
Use the benchmark tool (Settings → Transcription → Benchmark) to compare engines on your voice — you might get better results with a different provider.
Consider switching to a cloud provider (Whisper, Gemini) for better accuracy on non-English or highly technical content.

Model download problems

Parakeet downloads a ~200MB model on first use. If the download fails or stalls, check your internet connection and try again.
The model is stored locally — once downloaded, it works offline forever.
If transcription quality suddenly degrades, try re-downloading the model from Settings → Transcription.

Transcription is slow

Make sure you're using Parakeet (local) rather than a cloud provider. Cloud providers add network latency.
Close other GPU-intensive apps. Parakeet runs on Apple Silicon's Neural Engine and GPU.
Check that your Mac meets the requirements: Apple Silicon (M1+) and macOS 14+.

What's next

How to Write Better AI Prompts with Voice — why spoken prompts outperform typed ones, and techniques for getting better results
Words and Dictionaries — deep dive into custom vocabulary and dictionary packs
Skills: Voice Transcription Post-Processing — advanced skill configuration and custom skill creation
Changelog — what's new in the latest version

Have feedback on this guide or Doing itself? Email brian@doing.tools — it goes straight to the developer (there's only one).