The RSI Sound Myth-Buster: Ten Misconceptions that Result in RSI Sounding Terrible
“Better” is the enemy of good (Voltaire)
True. Except when what you call “good” is harmful and “better” is well within reach. (Yours truly)
Poor sound has proven to be one of the biggest nightmares in the videoconferencing and Remote Simultaneous Interpreting (RSI) setting. It makes listening unpleasant, causes meeting participants to tune out (bad sound causes listening fatigue) and makes simultaneous interpreting an arduous and hazardous business.
Poor sound undeniably hampers the interpreter’s performance. Evidence gathered by various studies place poor sound on top of the list of suspects when it comes to the recent, major surge in hearing problems) among conference interpreters (links), including debilitating and career-ending hearing conditions. Published scientific papers (link) show that similar issues are not uncommon in other professions exposed to poor-quality sound over headphones (eg. call-centre workers) even when the use of peak limiters and compressors make sudden peaks of loud noise mathematically impossible. Conversely, an uncommonly high incidence rate of similar issues is not found among categories of professionals who are exposed to reasonable levels of high-quality sound over headphones (radio anchors, voice actors etc).
Videoconferencing and RSI do not need to sound artificial, robotic, tinny and heavy on the ear. Current technology and average internet connections already allow the transmission of decent image and, above all, pristine, radio-quality sound. So when your remote speakers sound like this: (link) instead of like this: (link) it simply means that your remote event is not being organized and run properly with trained staff and the right equipment.
As will be shown below, if technology is not the real hurdle, the problem is of a much more human and organizational nature. Following is a list of widespread misconceptions that stand in the way of interpreters getting the sound they need and deserve to stay healthy and deliver satisfactory quality to their listeners:
1) “Sound is good when I can understand words well or at least well enough, and I don’t miss any chunks of information”.
False. Sound is good when it is natural, and when listening is pleasant and completely effortless. When your feed sounds artificial but still remains intelligible and can be interpreted by making some sort of “extra effort”, a warning alarm should go off in your head. Even when performed on perfect sound, simultaneous interpreting is the auditory equivalent of walking a tightrope, as – unlike other people – interpreters have to understand their source while they are generating interference with their own voices. Speakers typically sound “artificial” in the RSI setting because platforms save on bandwidth and server costs and create an environment where sound engineers are no longer necessary. Substandard microphones are allowed into the circuit and speakers are permitted, if not outright encouraged, to take the floor from noisy environments using whatever device they have available.
In order to make all of this possible, RSI sound usually conveys a heavily reduced and processed portion of the original input, that is, the frequency content naturally present in the timbre of a human voice. However, to manage multiple audio streams at the same time, interpreters need to rely on the richness and redundancy of a natural sounding voice. The way this works is explained here (scroll down to “On goes the microphone: let the Cocktail Party begin” to jump to the relevant section).
This is the main reason why simultaneous interpreting needs to happen within a strictly controlled environment. When exposed to typical videoconferencing / RSI sound, interpreters put both their sensory and their cognitive systems under a great deal of pressure: conference interpreting turns into telephone interpreting and becomes the equivalent of walking a tightrope on high heels while juggling burning torches. As such, the type of damage that far too many colleagues are experiencing should come as no surprise.
The question therefore needs to change from “can I understand it?” to “does this sound natural?”. Sound is good (and harmless) when you can close your eyes and can say “Yes, this is what a real human voice would sound like”.
2) “When sound isn’t good, that’s because the speaker has a slow connection”
This is a narrative of convenience. When an Ethernet cable is used, the average home connection in developed countries is powerful enough to receive and broadcast both high quality video and sound, because sound does not use up much bandwidth. Video does. Tripling the amount of bandwidth usually allocated to sound by videoconferencing/RSI platform would only result in a very minor increase in the amount of data being transmitted.
Full scale experiments conducted at major international organizations have clearly confirmed this. Where no other infrastructure is available, 4G can manage even higher upload and download speeds than many home connections, so it is rather difficult to believe that connectivity is the real issue here. Why then are RSI platforms almost invariably telling users that their connections are too weak? And why does this happen even in places where professional, corporate subscriptions guarantee huge download and upload speeds?
An educated guess would be that offering better sound would mean platforms having to process additional data to satisfy and achieve a result that IT developers and marketing people don’t really regard as necessary (sound is already “good enough”, it’s “speech optimized” or it’s “more than enough to understand speech”). Providers (not users!) would need to allocate additional bandwidth / server / computing power (which generates additional costs) and this is probably not a particularly palatable option for SIDPs. Why make a bigger organizational and financial effort if you can sell it the way it is (clients listen through their phone or computer speakers and aren’t particularly “fussy” anyway)? And you can simply tell work-starved interpreters that people have weak connections, and that narrow-band headset sound is the gold standard they should aspire to.
3) “In order to improve sound, speakers should use a USB headset with a boom microphone”
Absolutely false. 99% of USB headsets come with low quality microphones and onboard sound cards that heavily process their input. They are designed to be used for low-quality telephony applications on videoconferencing platforms that will typically not broadcast full-band quality sound. Why manufacture a Mercedes if users are going to drive it down a bumpy, unpaved country road? Professional headsets with boom microphones cost hundreds and usually come with connectors that would be too complex for the average home setup. They need professional interfaces, pop filters and very accurate placing, as a microphone positioned close to the mouth will pick up all sorts of annoying plosive and breathing sounds, and will even make scratching noises against bearded cheeks. A boom microphone in the hands of an unassisted, inexperienced user is almost invariably a badly positioned microphone. So, if you expect people to simply put it on and use it, what you have to do is process its input heavily and remove a lot of the signal. The result sounds artificial, robotic, often very sharp and heavy on the ear, especially when fed into a platform that reprocesses its input.
Why are platforms recommending USB headsets then? RSI platforms are run by software developers and marketing people and do not necessarily have sound engineers on their payroll. If I were to adopt the perspective of a developer working for a small/mid cap company I would probably think that if the microphone input is already narrow-band, and what I consider to be useless information is filtered out, then less data is fed into the platform at the source. Data equals bandwidth, and bandwidth is a cost. Yet I have failed to consider whether the lost information is universally useless.
A much better option to obtain satisfactory quality sound is a USB tabletop microphone (excellent solutions are available here starting from 50$ / 60€). These microphones are designed to produce radio / podcasting quality and their onboard sound cards do not overprocess sound.
4) “A headset with a boom microphone is always better than no headset at all”
It really depends. What better USB headsets provide compared to bad laptop microphones is higher intelligibility. But as shown above, intelligibility does not necessarily mean safety or quality. Intelligible can still be harmful. When a headset mic is used, you might be able to soldier on through a presentation with less cognitive effort than when your speaker is using a really bad laptop microphone, but this will just help you tolerate a higher amount of toxic sound for longer, thus increasing your exposure. Moreover, many telephones / tablets and high-end computer mics (especially Apple) perform better (and process less) than a lot of USB headsets. Which means that neither integrated mics nor USB headsets are a viable solution for RSI. Tabletop microphones and selected clip-on microphones are statistically a much better option.
5) “Tabletop and lapel microphones can cause trouble if misused. USB headset mics are more reliable”
Tabletops and lapel mics can be misused like all other microphones, but when used properly they are the only proven way to deliver rich,natural signal. Exactly like in the conference room, speakers talking too close or too far away from the microphone need to understand that they are doing it wrong. Exactly like in the conference room, no paper documents or other objects should come between their mouth and their microphone. Exactly like in the conference room, any background noise will be picked up by the speaker’s microphone. Speakers who can follow simple instructions will be able to manage tabletop and lapel microphones correctly. Speakers who cannot follow simple instructions are likely to mismanage all sorts of microphones, including conference room goose-necks.
No matter how it is used, a USB headset mic will almost invariably deliver a heavily processed, artificial signal, because as shown above, a USB headset mic is almost by definition a poorly placed microphone. That is the reason why it comes with sound-processing onboard electronics in the first place, while tabletops / lapels do not.
6) “But hi-fi platforms and quality microphones are not idiot-proof”
No they are not. But neither are telephone-quality platforms and bad microphones. Is the expectation of having a fully plug-and-play, completely “idiot-proof” solution that guarantees quality, trouble-free videoconferencing (and simultaneous interpreting) even from the middle of the road legitimate?
Quality is achieved if sound checks are run in advance for every speaker and if microphones are properly configured with the help of remote sound engineers / moderators who actually know what they are doing. Simultaneous interpreters are dependent on sound quality like trout are dependent on cold, clean and clear water. Quality also requires a sufficient, and above all. effective use of human resources. Videoconferencing is in general a much more difficult environment than in-person meetings, so expecting solutions that will work hassle-free and out-of-the-box in any situation with only minor human intervention is unrealistic.
7) “Convincing speakers to use a headset is already difficult enough, so asking them to use a proper microphone is impossible”
A logical fallacy. Nobody really wants to look and sound like a call-centre worker (no offense intended) on camera while addressing a conference that will likely remain on YouTube for the next 10 years, so the notion that speakers will drag their feet when proposed an unobtrusive device that will make them both look and sound professional (like they would on a TV or radio show) and would rather opt for a USB headset if they really have to is beyond all logic and understanding, but it’s a mantra I get a lot from USB headset prophets. A headset with a boom microphone is also not necessarily something most speakers will have available at their home / office and the less horrible-sounding models will cost as much (even twice as much) as a tabletop microphone with impressive performance (link: sample acquired over Zoom Hi-Fi). Practical experience shows that when offered choice between a headset with a boom mic and a tabletop / lapel microphone, very few speakers opt for the headset solution.
But the biggest problem behind this misconception is the unquestioned assumption that speakers should be the one in charge of sourcing their own peripherals. If we want RSI to sound good enough to allow safe and good quality simultaneous interpreting going forward, speakers must be provided with the right equipment by organizers and platforms. Decent lapel microphone solutions start from as little as 30 €, they fit in an envelope and getting a parcel containing a small, 50$ plug-and-play USB tabletop microphone to the speaker’s location anywhere around the world and devoting 10 minutes to remote configuration support cannot be considered an insurmountable problem in 2021. Much more complex and expensive logistical efforts are usually made when organizing multilingual in-person meetings. There is a huge difference between “technically or logistically impossible” and “we simply cannot be bothered with getting the right equipment, you just get used to it and do your best” or “not 100% compatible with a low-cost business model based on the delusion that quality equipment is no longer needed and nobody will ever notice the difference anyway”. These attitudes have never been considered with acceptable working conditions before, let alone by AIIC.
8) “Noise canceling is crucial if you want decent quality sound on the internet”
Nothing could be farther away from the truth. No algorithm on earth will remove annoying background noise from an audio feed without significantly affecting the quality of the signal. “Clean” sound does not mean you no longer get to hear any background noise: it means you get to hear a pristine, natural-sounding representation of whatever is picked up by a decent microphone. Any experts will tell you that active noise-canceling is utterly incompatible with professional, hi-fi sound
When they get an opportunity to try a good sounding videoconference on a clean platform, many colleagues wake up to the realization that if sound is rich and natural, background noise is usually not as annoying as they would expect.
But in the typical RSI model, where noise canceling is aggressively performed by both headsets (or many integrated computer mics) and the platform, the resulting signal is particularly poor (lots of missing frequency information) and muffled. At that point, any noise still making it through ends up being a much bigger nuisance than it should because:
a) softer components get artificially pumped up by automatic gain control algorithms and become particularly disruptive and;
b) when you are struggling to keep a natural and therefore full-spectrum signal (your voice) from overpowering a muffled and heavily processed signal (your RSI feed), any additional noise becomes unbearable no matter how small it might be.
Yet we are bombarded by claims that you desperately need noise-canceling to be able to interpret people who join a meeting while a vacuum cleaner is being used in the same room, dogs are barking, ambulances are passing by and loud construction work is being carried out outside the speaker’s open window. While none of this is impossible, any similar situation would be an extreme nuisance for both the speaker and the other participants, and the idea of having to “interpret it anyway” while everybody else struggles to hear it is hardly compatible with the notion of interpreters being highly-skilled and above all self-respecting professionals. As a matter of fact, these situations are not particularly frequent, and the price for being able to solider on through a few isolated incidents by means of aggressive noise canceling algorithms is having to struggle with muffled sound the rest of the time.
In reality, background noise is probably more of a nuisance for platforms than it is for interpreters. Harmless noise is “superfluous” information that codecs need to encode and broadcast. Noise removal at headset mic level proactively removes sizable chunks of the audible spectrum where noise (but also precious voice signal) can show up; it also reduces the frequency content of the chunks that get broadcast, resulting in… you guessed it: lower bandwidth and server costs.
Moreover, recently published research would seem to indicate that background noise is an obstacle for the human-machine interface: given that a number of RSI platforms are known to be using your output to train interpreting machines, computer algorithms might be hampered by background noise much more than human beings. Unlike interpreters, speech-to-text algorithms appear to like processed, muffled, telephone quality.
9) “Hi-fi quality is for music lovers, not for interpreters. We are processing words, not music”
Hi-fi means a high-fidelity reproduction of the original sound. The human voice produces much more information than the bare minimum needed to understand speech in an otherwise silent environment. Voice is a multilayered, redundant signal where the same information is repeated over and over again on different levels (harmonics) and our ears harness this redundancy whenever we are required to perform a difficult auditory task involving background noise or multiple signals building up a complex soundscape.
Simultaneous interpreting clearly qualifies as a difficult auditory task. People who lose their ability to hear high and very high frequencies (the part of the auditory spectrum that provides redundancy) struggle to process speech when concomitant sounds are present (read this to find out more). Spectral complexity is not just for the pleasure of demanding music lovers. It is a non-negotiable requirement for the performance of simultaneous interpreting. People who are forced to overspecialize their listening behavior in order to compensate for loss of high and very high frequency tend to develop hyperacusis. (published paper here)
10) “Ok, you win, maybe it can be done on Zoom Hi-Fi but Zoom is not an RSI platform. RSI platforms are much more complex and cannot give you hi-fi quality”
Zoom currently accounts for a huge portion of online events and is currently used by many international organizations. Its Hi-Fi function, which speakers can quickly activate, works well but appears to have gone unnoticed in the language services industry. Interestingly, even WebEx has recently introduced a “music mode” which sounds better than regular WebEx (when a decent microphone is used) but still does not compete with Zoom Hi-Fi. Big players are smelling the trend. Nobody wants to be listening to robotic sound for hours. The gaming platform Twitch has also been offering and promoting high-bitrate audio for a while, as a way of keeping streams entertaining and preventing viewers from tuning out. Skype has already made an option available to deactivate background noise removal, although it hasn’t begun offering a “music mode” yet.
RSI platforms with hi-fi or near hi-fi releases are already being tested and used by some international organizations. When decent microphones are used, these releases deliver good quality audio both on the floor and on over 20 different interpretation channels. With almost zero connection crashes and some packet loss due to poor wifi. But all that glitters is not gold. Cases of platforms claiming that they can offer Hi-Fi quality (if your connection is good enough, if the speaker has good connection and is using a headset with a boom mic etc), or who even claim they can “improve” incoming feeds if they are not good enough are also known. Quality can only be preserved, not improved, it cannot be produced by USB-headset mics and can only depend on connection quality in places where streaming a YouTube video is also a problem.
Can a hi-fi platform giving interpreters radio-quality sound guarantee a better-than-in-the-room interpreting experience at all times? It probably cannot. But it is safer, less frustrating and more conducive tool to provide to a decent, professional output.
(originally published on the AIIC blog)
Andrea CANIATO is voice researcher/consultant and certified voice trainer (Applied Physiology of the Voice) and EU accredited conference interpreter.