Remember that time, way back in the early 2000s, when Ozzy Osbourne got angry with his new BMW because its voice-control system couldn't understand him?
Today, we are all Ozzy – constantly rephrasing and e-nun-ci-ating to try and get Siri, Cortana, Alexa or whoever to understand us. While speech-controlled systems can be okay in a quiet environment, when they’re fighting against the background noise of tyres, engines and other people talking, it all goes to pot. Experts in the field regularly quote a – possibly apocryphal – tale of someone sending a speech-to-text message which incorrectly informed a family member that their mother had arrived, dead, at her destination.
Help has arrived in the shape of Liopa (Irish for lips), a Belfast-based start-up, borne out of research at Queen’s University Belfast. The company is developing software and systems which can read your lips, in a way unaffected by background noise, which can dramatically improve the accuracy of speech-controlled systems.
“We’re setup to commercialise some research that had been done in Queen’s, around eight to 10 years of research, on what we call ‘viseme’,” says Liopa chief executive Liam McQuillan, who is a telecoms veteran. “That is the way your lips move when you speak. They’ve found that someone’s lip movements are very speaker-specific, much in the way that your fingerprints are, so the idea was to develop this technique as a way of online user verification. So anywhere you could train a camera on someone’s face and get them to say something, you could track their lip movement, once they’re pre-enrolled, and you can compare that movement against a challenge phrase or a series of numbers to be read out.”
False matches
The initial work was in turning the lip-reading system into a security protocol, but that plan ran up against two hurdles: the fact that the biometric security market is already crowded, and that the consequences of getting a false positive result could be disastrous. As McQuillan points out: “You make a false match and someone can get into your bank account. So for the last nine months we kind of pivoted a little, and started to develop the speech recognition from where it is today, which is digit recognition, through to where it’s at limited vocabulary or menu control and eventually on to natural speech.
"So where we see this really playing is as a supporting technology to audio speech recognition. All the big guys are investing heavily in that area – tens of millions into personal assistants, such as Cortana, Siri, Amazon with Alexa, all of which are based on audio speech recognition.
“We see the car thing as a good initial use case, where we can train a camera on the driver’s face. Today it’s an RGB camera, but we’re developing an infrared one to take the outside illumination out of the equation. So where there’s a lot of background noise, road noise, engine revs, winding down the window, as long as you can see the driver’s face and read their lip movements you can combine the two techniques, and our software will improve audio speech recognition platforms.”
The idea is to combine both the audio and the visual inputs to create a more accurate system, as Fabian Campbell-West, Liopa’s head of research and development explained: “A lot of the audio command-and-control stuff is fairly clear. Where it can be difficult at the moment is when you try and dictate a number, for instance. So what we’re focusing on right now are the commands that control the in-car environment, and we’re confident we can get very good accuracy with those.”
Colloquial speech
What the system can’t do well, yet, is deal with people using “free speech” or speaking colloquially. Right now, the Liopa system has to first be trained into your lip movements, and then will require you to stick to a specific set of vocabulary and grammar.
“Where everyone wants to get to is free speech, which we’re not at yet,” says Richard McConnell, Liopa’s chief operating officer and technical officer. “You create what’s called a model, so you train the system to your lip movements. Then there’s the fine vocabulary and grammar model. Free speech is definitely on our road map though. The key thing for us is that in a noisy environment we’re getting 70-80 per cent accuracy at a point where the audio is almost unusable. The research here has identified that there’s about 30 specific visemes, and the more combinations of those that you capture the better.”
There are two aspects to the system that will be exceptionally tempting to the car makers who may become Liopa’s customers. First, the technology needs only a decent camera which can see the driver’s face, and a bit of software, all of which McConnell says can be readily accommodated by existing onboard systems: “At the minute it just needs an ordinary camera, and the computing power required we think is quite small. There’s enough compute power in a car already to deal with what we need. And when you add things like gesture control, it all helps.
“The granularity of image that we’re looking at only needs to be 16x16 pixels, but we can add accuracy with infrared cameras and depth sensors, a bit like the Kinect system in a gaming console. But generally there’s no special hardware required. We are moving towards using AI and neural network-type systems; but, again, many cars have systems which can already cope with that.”
Secondly, there's a chance for car makers, and their software and interface designers, to catch up with the likes of Apple and Google's Android. With the introduction of Apple CarPlay and Android Auto, the two smartphone software giants are increasingly taking over the space within the car, and many vehicle manufacturers are known to be concerned over the transfer of user data and experience. Developing a superior voice-control system could be a useful hook in getting drivers to switch out of CarPlay and back to the proprietary vehicle software.
"We've met with a number of car companies, and Toyota was especially keen on it," says McQuillan. "So a couple of months ago we closed a pre-seed round, which takes us up to the summer and allows us to develop an in-car system, and then we'll go out on a proper funding round, and we'll be talking to the car manufacturers."
Given the enormous concerns over driver distraction and accidents caused by phone use behind the wheel, anything that can make a voice-controlled system easier and more accurate to use, and which might keep drivers away from their keypads, will surely snag the interest of car makers. And possibly a few long-haired rock gods too.