Human transcription vs speech recognition

Having provided transcription services to small businesses right through to government departments for more than 12 years, we’ve occasionally been asked the question ‘Which is best, human transcription or speech recognition software?’

Here we’ll explore the subject a little deeper in light of claims made by some organisations offering speech recognition software.

Which is best?

Well the quick answer is that they both have their place but there are significant differences in what they produce that it’s worth keeping in mind if you need some transcription work carrying out.

How does ‘human’ transcription services work?

Obviously the clue is in the question. Transcription is carried out by expert transcribers (actual people), usually with familiarity of a particular area, medical, legal academic and so on, who listen to a recording.

This can be anything from a police interview, legal conference or market research focus group meeting. They then type the spoken words, ascribing them to individual speakers (often 6 or more), into a document.

Looking for Transcription Service? We offer a free trial, for more information

[thrive_2step id=’1984′]click here[/thrive_2step]

How does speech recognition work?

A recording is uploaded to a piece of computer software and then, using Speech to Text systems, the software generates a transcript directly from the recording.

Speech or voice recognition is offered in a couple of forms: one, off-the-peg software can be purchased which the user trains to recognise their voice, intonation, technical subject matter and vocabulary etc; or two, by transcription companies who use bespoke software to create the first draft transcription, then employ an editor to finalise the transcript.

Amazon Echo - Speech Recognition

So which is best, speech recognition or human transcription?

There’s no doubt that speech recognition software has improved over the past few years, and it is a very speedy option but having a human in the loop is far better in terms of the accuracy of the final result.

We’ve all had experience of Google Assistant, Siri or Amazon’s Alexa by now. Most of the time they get it right but that’s not good enough for most commercial organisations where an accurate record is crucial.

There are numerous instances where having that person in the loop is vital to ensure a meaningful transcript is produced. These include:

multi-speaker transcription (lots of over-speaking/interruptions, with anything from 2 to 20 speakers which computers are foxed by but the human brain can work through);
speakers located on different sites, for example conference calls;
poor quality recordings (not just background noise but interference and muttering/mumbling/paper rustling);
strong accents which are new to the software;
context of terms which a transcriber is able to interpret more accurately;
homophones – similar sounding words, spelled differently can produce the wrong meaning;
where there is a requirement for extensive formatting as well as producing the spoken text.

There may well be a place for speech recognition transcription but voice or speech recognition still rely largely on the software being ‘trained’ to recognise the nuances of a particular individual’s speech.

This offers some scope when transcribing recordings of general, single-speaker dictation and some presentations (which will then need to be checked/edited), where the author has ‘trained’ the software to work with their voice.