Saturday, November 06, 2010

Language

Although I use Google Translate all the time, it's a shame that Google is still pursuing an approach to natural language processing that should have been abandoned in the late 1950's, after the publication of Syntactic Structures.

Google is stuck in the idea that a corpus, minced through a decision procedure, can somehow provide automated natural language capabilities, before we've successfully understood the biology of language! It's like assuming that lots of observations of objects moving will somehow provide you a theory of gravitation.

As a result, Google Translate just doesn't work. The reason it's still useful is that we have this rich biological mechanism available to us, mostly subconscious, that can make corrections and fill in the gaps.

Type anything into Google Translate, then translate back the results. It's very rare that you'll get anything equivalent to your original. It has something like a 99% failure rate, for me.

So I type:

I wonder if this will ever work?

Google Translate renders this into French:

Je me demande si cela va fonctionner?

... which is already wrong, but to complete the exercise, switch the translation and you get:

I wonder if this will work?

Don't get me wrong: Google Translate is a useful tool. But, honestly, Google, you cannot get from here to real automated translation, if you're relying upon the techniques of statistical analysis. You need a real computational theory of language, a device that, for language L, to quote Chomsky in 1956, "generates all the grammatical sentences of L and none of the ungrammatical ones". This is an incredibly tough problem: there are no statistical shortcuts. It cannot be done unless you start to keep up with the biologists of language. These linguists have made much progress over the last 54 years.