私密语音转录：在没有云端风险下转换音频

为什么要把转录放在云端之外

商业转录服务（AWS Transcribe、Google Speech-to-Text、Whisper API）会把你的录音发送到远程服务器——这不仅暴露了你的话语，还暴露了录音中每个人的声音，而他们可能从未表示过同意。对于心理治疗会谈、律师与委托人通话、机密会议，以及任何涉及儿童的内容来说，这都是个严重的问题。在浏览器中本地运行 Whisper 则彻底消除了上传。（关于声音为何是格外敏感的生物特征数据、以及它如何可能被利用，详见语音隐私的 Learn 文章。）

如何在不上传云端的情况下转写音频

1选择一款在本地运行而非云端运行的工具。关键的判断方法：尊重隐私的转写工具会把 AI 模型——大多基于 OpenAI 开源的 Whisper 模型——下载到你的设备上并在本地处理音频，因此你的文件永远不会被上传。使用 WebAssembly 的浏览器工具和离线桌面应用都符合这一点。
2载入你的音频或视频文件（常见格式：MP3、MP4、WAV、M4A、OGG、FLAC、WebM）。在浏览器工具中，Whisper 模型只在首次使用时下载一次——通常约 120 MB——之后每次转写都在本地运行，即使离线也可以。
3在处理前设置语音的语言。明确指定语言而不是依赖自动检测，能显著提升准确度，尤其是带口音的语音和非英语音频。许多基于 Whisper 的工具还能把语音翻译成英语。
4在本地审阅、编辑并导出。好的工具会显示带时间戳、可供你修正的片段，然后导出为纯文本或 SRT 字幕——全部都在你的设备上生成。由于没有任何内容被上传，你可以在整个过程中断开网络，音频依然不会离开你的设备。

获得更佳转录效果的技巧

Audio quality is the single biggest factor affecting the final transcription accuracy. Those recordings that contain background noise, multiple voices overlapping and interweaving with one another, or an overall volume that is too low will all inevitably produce a greater number of recognition errors. So please use a directional microphone wherever possible, and choose a quiet environment in which to do your recording. For longer recordings (such as those exceeding 30 minutes), it is recommended that you consider splitting them into several smaller segments to process separately — doing so can both effectively improve recognition accuracy and let you review the results segment by segment, progressively, rather than painfully waiting around for the entire file to finish processing in full. Whisper performs exceptionally well at understanding contextual context, so as long as the audio quality is good enough, it can handle and cope very well with vocabulary from specific specialized fields (such as all kinds of medical terms, technical jargon, and so on). After transcription is complete, and before you put this result to any important use, be sure to first read it through and check it over from beginning to end — you should know that while AI transcription is already highly accurate, it is, after all, not perfect, and those homophones within it, or some uncommon and unusual names of people and places, may still need to be corrected and revised by hand.