User Guide

Progressive guide from first launch to advanced usage.

Level 1: Getting Started
Level 2: Daily Use Basics
Level 3: Choosing the Right ASR Backend
Level 4: AI Enhancement
Level 5: Preview Power Features
Level 6: Direct Mode & Streaming
Level 7: Clipboard Enhancement
Level 8: Custom Enhancement Modes
Level 9: Multi-Provider Setup
Level 10: Vocabulary & Conversation History
- Vocabulary Retrieval
- Conversation History
Level 11: Launcher
Level 12: Fine-Tuning & Troubleshooting
What's Next?

A progressive guide from first launch to advanced usage. Follow the levels in order — each builds on the previous one.

Level 1: Getting Started

Goal: Install 闻字 and transcribe your first sentence.

Install

Option A — Download Release (easiest):

Download WenZi.app from the Releases page.
Drag it to /Applications.
Double-click to launch.

First launch: macOS blocks unsigned apps. Go to System Settings → Privacy & Security, find the 闻字 blocked message, and click Open Anyway.

Option B — Build from Source:

git clone https://github.com/Airead/WenZi
cd WenZi
uv sync
./scripts/build.sh        # builds WenZi.app in dist/

Option C — Run from Source (for developers):

git clone https://github.com/Airead/WenZi
cd WenZi
uv sync
uv run python -m wenzi

Grant Permissions

On first launch, macOS will ask for:

Permission	Why
Microphone	Record your voice
Accessibility	Type text into other apps
Speech Recognition	Only needed for Apple Speech backend

Grant all requested permissions in System Settings → Privacy & Security.

First Launch: Ready Immediately

The default ASR backend is Apple On-Device Speech — it uses the built-in macOS speech recognizer, so no model download is needed. 闻字 is ready to transcribe right after granting permissions.

Note: If you later switch to FunASR or MLX-Whisper in Settings, 闻字 will need to download a model (~75 MB to ~1.6 GB depending on the model). During download:

The menubar icon changes to a download icon (⬇) with a percentage like DL 45%

Please wait for the download to complete before trying to transcribe

Click menubar → View Logs... to open the built-in log viewer and monitor download progress in real time

Once loading finishes, the icon changes back to a microphone icon (🎙) and the status shows "Ready"

Tip: If a download fails or is interrupted, delete the cache directory (~/.cache/modelscope/ for FunASR, ~/.cache/huggingface/ for MLX-Whisper) and restart 闻字 to retry.

Your First Transcription

Look for the microphone icon (🎙) in the menubar — that means 闻字 is ready.
Open any text input (Notes, browser, editor, terminal…).
Hold the fn key and speak.
Release fn — the transcribed text appears.

That's it! You've completed the basic workflow.

Understanding the Menubar Icon

The menubar icon changes to reflect the current status:

Icon	Status	Meaning
🎙 (mic.fill)	Ready	Idle, ready to record
〰 (waveform)	Recording...	Capturing audio
💬 (text.bubble)	Transcribing...	Processing speech to text
✨ (sparkles)	Enhancing...	AI enhancement in progress
👁 (eye)	Preview...	Preview panel is open
⬇ (arrow.down.circle) + DL X%	Downloading...	Model download in progress
⚙ (cpu)	Loading...	Loading model into memory
⚠ (triangle)	Error	Something went wrong

Level 2: Daily Use Basics

Goal: Understand the two output modes and basic menubar controls.

Preview Mode vs Direct Mode

闻字 has two ways to deliver results:

Mode	Behavior	When to use
Preview (default)	Shows a floating panel — review and edit before confirming	When accuracy matters, or you want to check before typing
Direct	Types text immediately into the active app	When speed matters and you trust the transcription

Toggle via: menubar → Settings... → General tab → Preview checkbox.

Preview Panel Basics

When Preview is on, after recording you'll see a floating panel:

Confirm (Enter) — types the text and closes the panel
Copy to clipboard (⌘+Enter) — copies the text to clipboard instead of typing
Cancel (Esc) — discards the text
Edit — click the text area to modify before confirming

Menubar Overview

Click the microphone icon in the menubar to see the menu:

🎙
├── Ready                    (status indicator)
├── ─────────────────────
├── Enhance Clipboard        AI-enhance selected text (Ctrl+Cmd+V)
├── Browse History...        Search and browse past transcriptions
├── Settings...              Open the settings panel (4 tabs)
├── ─────────────────────
├── View Logs...             Open log viewer
├── Usage Stats              View usage statistics
├── About 闻字          Version info
└── Quit

All model selection, AI enhancement configuration, and hotkey management are done through the Settings panel — not the menubar menu directly.

Recording Feedback

While holding fn, a floating indicator with audio level bars shows you're recording. A sound plays on start and stop (configurable in Settings → General).

When the ASR backend supports streaming (e.g., Apple Speech), a live transcription overlay appears below the recording indicator, showing partial transcription text in real time as you speak. This gives you immediate visual feedback without waiting for the recording to end.

Recording Controls

While holding the recording hotkey, you can press additional keys to control the session:

Key (while holding `fn`)	Action
`Cmd` (default)	Restart recording — discards current audio and starts a new recording
`Space` (default)	Cancel recording — discards audio and returns to idle
`Z`	Show last preview — cancels recording and opens the last preview result

The restart and cancel keys can be customized in Settings → General (or via config: feedback.restart_key and feedback.cancel_key).

Level 3: Choosing the Right ASR Backend

Goal: Pick the best speech engine for your needs.

Backend Comparison

Backend	Language	Speed	Accuracy	Download Size
Apple Speech (default)	Multiple	Fast	Good	None (built-in)
FunASR	Chinese	Fast	High (Chinese)	~500 MB
MLX-Whisper	99 languages	Medium	High	75 MB – 1.6 GB
Whisper API	Multiple	Depends on network	High	None (cloud)

How to Switch

Open Settings... → STT tab. You'll see:

Local section: Radio buttons for all available local ASR presets (FunASR, MLX-Whisper variants, Apple Speech)
Remote section: Cloud-based ASR providers you've configured

Click a radio button to switch. The model will start loading (or downloading if not yet cached).

First download reminder: When switching to a new MLX-Whisper model for the first time, the model needs to download. Watch the menubar icon for DL X% progress. Model sizes range from ~75 MB (tiny) to ~1.6 GB (large-v3-turbo).

Recommendations

Zero-setup, any language → Apple Speech (default — no download, supports real-time streaming)
Chinese only → FunASR (best accuracy for Chinese, fully offline)
English or multilingual, high accuracy → MLX-Whisper small or large-v3-turbo
Best accuracy, don't mind latency → Whisper API via Groq (free tier available)

Level 4: AI Enhancement

Goal: Use an LLM to proofread, translate, or reformat transcribed text.

AI enhancement is optional — by default it's off. When enabled, your transcribed text is sent to an LLM for post-processing before output.

Step 1: Set Up an LLM Provider

You need an LLM backend. Two easy options:

Option A — Local with Ollama (free, private):

Install Ollama and run ollama pull qwen2.5:7b
That's it — 闻字.s default config points to Ollama

Option B — Cloud API (e.g., DeepSeek, OpenAI):

Open Settings... → LLM tab → Add Provider...
Fill in provider details:

name: deepseek
base_url: https://api.deepseek.com/v1
api_key: sk-your-key
models:
  deepseek-chat

Click Verify → Save

Step 2: Select an Enhancement Mode

Open Settings... → AI tab. Select a mode:

Mode	What it does
Off	No enhancement (raw transcription)
纠错润色 (Proofread)	Fix typos, grammar, punctuation
翻译为英文 (Translate EN)	Translate to English
命令行大神 (Commandline)	Convert speech to shell commands

Step 3: Try It

Make sure an LLM provider is configured and a mode is selected.
Hold fn, say something, release.
The result now goes through the LLM before appearing.

Tip: Start with "纠错润色" (Proofread) — it's the most universally useful mode.

Level 5: Preview Power Features

Goal: Master the preview panel's editing and switching capabilities.

With Preview mode on and AI enhancement active, the preview panel becomes a powerful editor.

Quick Mode Switching

Press ⌘1 through ⌘9 to instantly switch enhancement modes and re-process the same audio:

⌘1 = first mode in list (e.g., Proofread)
⌘2 = second mode (e.g., Translate EN)
⌘3 = third mode (e.g., Commandline)
…and so on for custom modes

Result Caching

When you switch modes in the preview panel, 闻字 caches completed results. Switching back to a previously used mode shows the cached result instantly (marked [cached]) — no API call needed.

The cache is cleared when new audio is recorded.

Preview History

闻字 keeps an in-memory history of your last 10 preview results (cleared on app restart). This lets you go back to a previous transcription without re-recording.

History dropdown: Click the clock icon in the preview panel's toolbar to open a dropdown showing recent previews. Select one to reload it into the panel.
Quick recall: Press fn+Z at any time (even outside the preview panel) to cancel any active recording and instantly open the most recent preview result.

Web Preview Panel

The preview panel uses a modern WKWebView-based (HTML/CSS/JS) interface by default, providing a polished look with dark mode support. You can switch between the web-based and native AppKit preview in Settings → General → Web Preview toggle.

Other Preview Features

Feature	How
Edit text	Click the text area and type
Copy to clipboard	`⌘+Enter` — copies instead of typing into the active app
Toggle punctuation	Check/uncheck the Punc checkbox to re-transcribe with/without punctuation
Switch STT model	Use the STT dropdown in the panel
Switch LLM model	Use the LLM dropdown in the panel
Play audio	Click the play button to hear the recording
Save audio	Click save to export the recording as a file
Google Translate	Click the translate button to open Google Translate with current text

Level 6: Direct Mode & Streaming

Goal: Use 闻字 for fast, hands-free input with real-time AI feedback.

Enable Direct Mode

Turn off Preview: Settings... → General tab → uncheck Preview.

Now when you release the hotkey, text is typed directly into the active app — no panel, no confirmation needed.

Real-Time Streaming STT

When using an ASR backend that supports streaming (currently Apple Speech), 闻字 shows a live transcription overlay during recording. Partial text appears in real-time as you speak, giving you instant feedback before you even release the hotkey.

This works in both Preview and Direct modes. In Direct mode, it is especially useful because you can see the transcription forming and decide whether to keep or cancel it.

AI Streaming Overlay

In direct mode, after recording ends, a streaming overlay appears showing the processing pipeline:

Transcription phase — the overlay first shows the ASR result (or streams partial text if the backend supports it)
Enhancement phase (if AI enhancement is active) — the LLM processes the text in real-time, with tokens appearing as they are generated

Controls during the overlay:

Press Esc to cancel transcription/enhancement and discard the result
The overlay shows token count and processing status
Once complete, the final text is typed automatically

When to Use Direct Mode

Chat apps where speed matters
Terminal / command line input
Any workflow where you trust the AI output and don't need to review

Level 7: Clipboard Enhancement

Goal: AI-enhance any text in any app, not just speech transcriptions.

How It Works

Select text in any application.
Press Ctrl+Cmd+V (default hotkey).
闻字 copies the selection, sends it to the LLM with the current enhancement mode, and outputs the result.

You can also trigger it from the menubar: click Enhance Clipboard.

Use Cases

Select a rough draft → enhance with Proofread mode
Select Chinese text → translate to English
Select a task description → convert to shell command

Output Behavior

Preview on: Result appears in the preview panel for review
Preview off: Result replaces via clipboard

Customize the Hotkey

Edit ~/.config/WenZi/config.json:

{
  "clipboard_enhance": {
    "hotkey": "ctrl+cmd+v"
  }
}

The hotkey format is modifier+modifier+key. See Level 12 for format details and examples.

Level 8: Custom Enhancement Modes

Goal: Create your own AI modes and chain pipelines.

Create a Custom Mode

Via Settings (easy):

Open Settings... → AI tab → Add Mode...
Edit the template, click Save, enter a mode ID.

Via file (flexible):

Create a .md file in ~/.config/WenZi/enhance_modes/:

---
label: Formal Email
order: 60
---
You are a professional email writing assistant.
Rewrite the user's input as a formal, polished email body.
Use appropriate greetings and closings if context suggests an email.
Maintain the original intent and key information.
Output only the email text without any explanation.

The filename (without .md) becomes the mode ID. Restart to load.

Create a Chain Mode

Chain modes run multiple steps sequentially:

---
label: 润色+翻译EN
order: 25
steps: proofread, translate_en
---
First proofreads the text, then translates to English.
(This body is documentation only — each step uses its own prompt.)

Tips for Good Prompts

Be specific about what to do AND what NOT to do
End with "Output only the processed text without any explanation"
Use order values with gaps (10, 20, 30…) so you can insert modes between them

See Enhancement Mode Examples for ready-to-use templates covering email, meeting notes, translation, developer tools, and more.

Level 9: Multi-Provider Setup

Goal: Configure multiple ASR and LLM providers and switch between them.

Why Multiple Providers?

Use a fast local model (Ollama) for simple tasks, cloud API for complex ones
Have a backup when one provider is down
Compare results across different models

Add Providers via Settings

LLM providers: Settings → LLM tab → Add Provider...

ASR providers: Settings → STT tab → Add Provider...

Both use the same dialog format:

name: provider-name
base_url: https://api.example.com/v1
api_key: your-key
models:
  model-1
  model-2

Switch at Runtime

In the Settings panel, all configured models appear as radio buttons. Click to switch — no restart needed.

In Preview Panel

You can also switch LLM and STT models directly from the preview panel's dropdowns, making it easy to compare results from different models on the same audio.

See Provider & Model Setup Guide for detailed examples covering Ollama, OpenAI, DeepSeek, Groq, OpenRouter, Qwen, and more.

Level 10: Vocabulary & Conversation History

Goal: Teach 闻字 your personal terms and maintain topic context across turns.

Vocabulary Retrieval

Problem: ASR often misrecognizes proper nouns, technical terms, and names (e.g., "萍萍" → "平平").

Solution: 闻字 builds a personal vocabulary from your correction history and uses it to improve future results.

How to Build Vocabulary

Use Preview mode with AI enhancement — edit the result when the AI gets a term wrong.
Each edit is logged to ~/.config/WenZi/conversation_history.jsonl with a user_corrected flag.
Auto build (default): After every 10 corrections, vocabulary is rebuilt automatically in the background.
Manual build: Settings → AI tab → Build Vocabulary...

Enable Vocabulary

Settings → AI tab → toggle Vocabulary (N). The number shows how many entries are indexed.

When enabled, relevant vocabulary entries are retrieved via embedding similarity and injected into the LLM prompt, helping it correct domain-specific terms.

Conversation History

Problem: Each transcription is independent — the LLM doesn't know what you just said.

Solution: 闻字 injects recent confirmed outputs into the AI prompt, so the LLM understands the current topic.

Enable

Settings → AI tab → toggle Conversation History.

How It Works

Only preview-confirmed records are used (ensuring quality)
Recent entries are formatted efficiently with arrow notation for corrections
The LLM uses this context to maintain consistency (e.g., always using the correct name spelling)

Browse History

Menubar → Browse History... opens a full-featured history browser with:

Text search — search across all transcription text fields
Tag filters — click tag pills to filter by enhance mode (proofread, translate, etc.), STT model, LLM model, or whether corrections were made
Time range filtering — filter by today, last 7 days, last 30 days, or all time
Record deletion — select a record and click Delete to remove it
Edit and save — modify the final text of any record and save changes
Archived records — check the "Archived" toggle to include records from monthly archives

Auto-Rotation and Archiving

When conversation history exceeds 20,000 records, 闻字 automatically archives older records into monthly files under ~/.config/WenZi/conversation_history_archives/YYYY-MM.jsonl. The main history file keeps the most recent 20,000 records for fast access, while archived records remain searchable through the history browser.

See Conversation History Enhancement for technical details.

Level 11: Launcher

Goal: Use the built-in Launcher for quick access to apps, files, clipboard, bookmarks, and snippets.

The Launcher is a keyboard-driven search panel built into 闻字's scripting system. It works like Alfred or Raycast — press a hotkey, type to search, and press Enter to act.

Enable the Launcher

Enable scripting: Settings... → General tab → Scripting toggle
Edit ~/.config/WenZi/config.json and set:

{
  "scripting": {
    "chooser": {
      "enabled": true,
      "hotkey": "cmd+space"
    }
  }
}

Restart 闻字.

Basic Usage

Press Cmd+Space (or your configured hotkey) to open the Launcher.
Start typing to search apps — results appear instantly.
Press Enter to open the selected app, or ⌘+Enter to reveal it in Finder.
Press Esc to close.

Prefix Search

Use a prefix followed by a space to search a specific source:

Type this	To search
`f readme`	Files named "readme"
`cb hello`	Clipboard entries containing "hello"
`bm github`	Bookmarks matching "github"
`sn email`	Snippets matching "email"

Keyboard Shortcuts

Shortcut	Action
`↑` `↓`	Navigate results
`Enter`	Open / execute
`⌘+Enter`	Reveal in Finder
`⌘1` – `⌘9`	Quick select by position
`Esc`	Close

Extending with Scripts

You can add custom data sources to the Launcher via scripts. See Scripting Documentation for the wz.chooser.source API.

Level 12: Fine-Tuning & Troubleshooting

Goal: Optimize your setup and solve common problems.

Settings Panel

Menubar → Settings... opens a panel with 4 tabs:

Tab	What you can configure
General	Recording hotkeys, sound feedback, visual indicator, preview toggle, web/native preview, restart/cancel key selection, scripting toggle, custom config directory
STT	Local ASR model selection, remote ASR provider management
LLM	LLM provider and model selection, provider management
AI	Enhancement mode (displayed in defined order, not alphabetical), thinking mode, vocabulary, conversation history, auto build

The Settings panel remembers the last active tab across sessions. At the bottom, toolbar buttons provide quick access to Show Config, Edit Config, and Reload Config.

Custom Config Directory

In the General tab, you can set a custom config directory to store 闻字 configuration files in a location of your choice (e.g., a synced folder). After changing the directory, 闻字 will prompt you to restart for the change to take effect.

Scripting Toggle

The General tab includes a Scripting toggle to enable or disable the scripting/plugin system. When enabled, 闻字 loads and executes Python scripts from the configured script directory. See the Scripting Documentation for details on writing plugins.

Hotkey Configuration

闻字 supports flexible hotkey configuration. The recording hotkey is configured in the Settings panel (General tab), while the clipboard enhance hotkey is set in the config file.

Hotkey Format

Hotkeys use the format modifier+modifier+key, where:

Modifiers: cmd (or command), ctrl, alt (or option), shift
Regular keys: a–z, 0–9
Special keys: fn, f1–f12, esc, space
Right-side modifiers: cmd_r, ctrl_r, alt_r, shift_r

Examples

Hotkey	Config value	Description
Fn key (hold to record)	`"fn"`	Default recording hotkey — single special key
F5 key	`"f5"`	Use a function key to record
Ctrl+Cmd+V	`"ctrl+cmd+v"`	Default clipboard enhance hotkey
Shift+Cmd+Space	`"shift+cmd+space"`	Alternative with Space key
Alt+D	`"alt+d"`	Option+D combination
Ctrl+Shift+R	`"ctrl+shift+r"`	Triple modifier example
Ctrl+Cmd+1	`"ctrl+cmd+1"`	Number key combination

Config File Examples

{
  "hotkeys": {
    "fn": true,
    "f5": true
  },
  "clipboard_enhance": {
    "hotkey": "shift+cmd+space"
  }
}

Multiple recording hotkeys can be enabled simultaneously by adding entries to the hotkeys map with true values. Set to false to disable a hotkey without removing it.

Configuration File

Default location: ~/.config/WenZi/config.json

The config directory can be changed to a custom path via Settings → General → Config Directory (stored in macOS preferences, survives config file changes).

You only need to include fields you want to change — everything else uses defaults. After editing, click Reload Config in the Settings toolbar to apply without restarting.

See Configuration Reference for all options.

Logging

Logs are saved to ~/Library/Logs/WenZi/wenzi.log (5 MB rotation, 3 backups).

View logs (recommended): Menubar → View Logs... opens the built-in log viewer — the easiest way to check logs, monitor model download/loading progress, and diagnose issues in real time.

Log files are also available on disk at the path above if you prefer an external editor.

Usage Statistics

Menubar → Usage Stats opens an interactive statistics dashboard with:

Summary cards — total transcriptions (with today's count), total tokens consumed (with cached input token breakdown), accept rate, and total recording time
Interactive charts (powered by Chart.js) with selectable time ranges (7/14/30 days):
Daily Transcriptions — stacked bar chart showing Direct vs Preview mode usage per day
User Actions — stacked bar chart of Accept / Modified / Cancel actions per day
Token Usage — stacked bar chart of Prompt / Completion / Cached tokens per day
Enhance Modes — stacked bar chart showing usage of each enhancement mode per day

Common Issues

Text doesn't type into the app

Check Accessibility permission in System Settings
Try switching output method in config: "output": {"method": "clipboard"}

Model download takes too long

The menubar shows DL X% during download — this is normal when switching to a model for the first time
FunASR: ~500 MB, MLX-Whisper large-v3-turbo: ~1.6 GB
Check the log viewer for detailed progress
If partially downloaded, delete the cache directory (~/.cache/modelscope/ for FunASR, ~/.cache/huggingface/ for MLX-Whisper) and restart

LLM enhancement times out

Increase timeout: edit config.json → ai_enhance.timeout (default: 30s)
Check if your LLM provider is reachable
For Ollama, ensure it's running: ollama serve

Preview panel doesn't appear

Make sure Preview is enabled in Settings → General
Try clicking the menubar icon to bring the app to focus

Notifications don't work during development

Expected when running via uv run without app bundling
Notifications work normally in the packaged .app version

Keyboard Shortcuts Summary

Shortcut	Context	Action
`fn` (hold/release)	Global	Record / stop and transcribe
`fn` + `Cmd`	During recording	Restart recording (discard current audio, start new)
`fn` + `Space`	During recording	Cancel recording (discard audio, return to idle)
`fn` + `Z`	During recording	Cancel recording and show last preview history
`Ctrl+Cmd+V`	Global	Clipboard enhancement
`Cmd+Space`	Global	Open/close Launcher (if enabled)
`Enter`	Preview panel	Confirm and type text
`⌘+Enter`	Preview panel	Copy to clipboard
`Esc`	Preview panel / Streaming overlay	Cancel
`⌘1` – `⌘9`	Preview panel	Switch enhancement mode
`⌘A/C/V/X`	Preview panel	Standard edit shortcuts
`⌘Z` / `⌘⇧Z`	Preview panel	Undo / Redo

Note: The restart key (Cmd) and cancel key (Space) are configurable in Settings → General or via config (feedback.restart_key and feedback.cancel_key). Available choices: cmd, ctrl, alt, shift, space, esc.

What's Next?

You now know everything 闻字 offers. Here are some ideas to get the most out of it:

Create modes for your workflow — meeting notes, code review comments, Slack messages
Build chain modes — proofread → translate, or summarize → format
Accumulate vocabulary — the more you correct, the smarter it gets
Try different models — compare Groq's speed vs local Ollama's privacy vs OpenAI's accuracy
Write scripts — extend 闻字 with Python scripts for custom hotkey actions (see Scripting Documentation)
Browse Enhancement Mode Examples for inspiration

For technical details on any feature, see the documentation index.

Why Error Correction Is So Powerful →

Five layers of correction that get smarter every time you use it.

User Guide

Table of Contents

Why Error Correction Is So Powerful →