Machines have already surpassed humans when it comes to mundane tasks such as number crunching and filing, but now they can tell what you are saying just by watching you speak. A team of UK researchers has created a computer program which can lip read more accurately than experts.
Developed at the University of Oxford, the software, called LipNet is able to work out what people were saying more than nine times out of 10. While the average accuracy of an experienced lipreader is around 52 per cent, LipNet managed to hit levels of 93.4 per cent accuracy. LipNet works using a neural network to map mouth movements of people to a database of set sentences.
During the training phase, the AI software was presented with video footage of people giving a strange series of commands, with cryptic phrases such as ‘set blue by A four please’. By breaking down the video frames, LipNet was able to match the movements of people’s mouths with the known commands.
In tests, almost 29,000 videos of two men and two women were used to train the AI, with the results compared against the success rate of three hearing-impaired people who can lipread. The project received funding from Google’s DeepMind, among others, and surpassed the previous milestone for machine lipreading accuracy of 79.6 per cent.
But the team says its goal is to train with real world examples, with Yannis Assael, one of the researchers on the project, writing that ‘performance will only improve with more data’. In their paper, posted on online ArXiv server, the researchers say that the technology could have huge potential.
They explain: ‘Machine lipreaders have enormous practical potential, with applications in improved hearing aids, silent dictation in public spaces, covert conversations, speech recognition in noisy environments, biometric identification, and silent-movie processing.’
Allaying fears that the software could be used by 'Big Brother' to monitor conversations through the UK's extensive network of CCTV cameras, Mr Assael told The Mirror: 'LipNet has no application in the world of surveillance, simply because lipreading requires you to see the subject’s tongue - meaning that the video has to be straight on and well-lit to get a good result.'
Lipreading is an impressive feat, expanding the repertoire of 'vision'-based learning functions of machines. Machine learning - learning by example - can help computers to get a good idea of a what an image, situation or the like means, even if they are encountering it for the first time.
Researchers at Google's DeepMind recently boosted this capability by endowing machines with memory to develop 'one-shot learning'. In this way, systems were able to recognise an object after seeing it only once. Such steps seem small, but could lead to much faster training of AI systems, pushing the field forward at an even greater pace.