Remotely Recording Speech and Turning it into Searchable Text with Metasploit & Watson

Technology has made some amazing advances in the past few years. It makes you wonder what computer security will look like in the future. For example, how cool would it be to be able to remotely turn on a microphone, and record what it said. Then process the recorded speech – turning it into searchable text, and scanning it to look for keywords like “Password” or “Social Security Number”?

What if I said you can do that right now?

Well, you can!

Thanks to some amazing work by AT&T labs and “Sinn3r” from the Metasploit development team, you can now take any .wav file that contains spoken words, and search it for keywords like account information and passwords.

AT&T labs has opened up their “Watson” speech to text technology to the public, releasing a development SDK and API so programmers can add speech recognition to their products. With a proof of concept script written by Sinn3r from Rapid7, you can now add speech to text capability to Metasploit!

How does it work?

Amazing!

I will cover it in deeper detail in a following post, but here is a quick walk through:

I had a “target” system attach to my “attacker” Backtrack 5r3 box running a Java exploit. Once the target Windows 7 system (fully patched and updated of course, with AV protection enabled) ran the backdoored Java, I had an open session with it:

Next, I simply ran the “record_mic” command to remotely turn on and capture any audio within the area of the target system:

Finally, I simply fed the resulting .wav file into the sound analyzer script. It converted the sound file to text and searched it for keywords.

Did it find anything?

Of course! It correctly scanned the file and noticed that the word “password” was mentioned:

Okay, it wasn’t 100% correct. I used a four number password, followed by a dash and four more numbers. As you can see, the AT&T program mistook it and tagged it as a phone number, dropping the first number off. I also said “secured” instead of “picture” at the beginning of the line.

AT&T tagged the transcription confidence level at .48, this means that the program was about 50% confident that it had the right translation, which was about correct.

Even so, this technology is AMAZING! You have to think, during the process a voice was copied live from a remote system, turned into text and then analyzed for keywords. Without any “voice training” like so many voice programs need, Watson pretty accurately deciphered the .wav file and gave us a useable output.

We will take a much closer look at this in the next few posts. There were a few hurdles to overcome getting the script to run on Backtrack 5r3, so I will create a step by step tutorial. We will even look at some other uses for the technology.

Awesome job AT&T, Sinn3r and the Metasploit development team!