Automatic speech-to-text transcription

evidence from a smartphone survey with voice answers

authored by
Jan Karem Höhne, Timo Lenzner, Joshua Claassen
Abstract

Advances in information and communication technology, coupled with a smartphone increase in web surveys, provide new avenues for collecting answers from respondents. Specifically, the microphones of smartphones facilitate the collection of voice instead of text answers to open questions. Speech-to-text transcriptions through Automatic Speech Recognition (ASR) systems pose an efficient way to make voice answers accessible to text-as-data methods. However, there is little evidence on the transcription performance of ASR systems when it comes to voice answers. We therefore investigate the performance of two leading ASR systems–Google’s Cloud Speech-to-Text API and OpenAI’s Whisper–using voice answers to two open questions administered in a smartphone survey in Germany. The results indicate that Whisper produces more accurate transcriptions than Google’s API. Both systems produce similar errors, but these errors are more common for the Google API. However, the Google API is faster than both Whisper and human transcribers.

Organisation(s)
Sociology Department
External Organisation(s)
German Centre for Higher Education Research and Science Studies (DZHW)
GESIS - Leibniz Institute for the Social Sciences
Type
Article
Journal
International Journal of Social Research Methodology
ISSN
1364-5579
Publication date
01.01.2025
Publication status
E-pub ahead of print
Peer reviewed
Yes
ASJC Scopus subject areas
General Social Sciences
Electronic version(s)
https://doi.org/10.1080/13645579.2024.2443633 (Access: Open)