Code Release: Language Detection and Translation

Google Language Python Module

By Jon BeilinGoogle Language Python Module

One of the tenets of Web Ecology is accessibility to the field through open tools and open data. At the Web Ecology Project, we’re working to get more of our code in a clean, commented, and releasable state. The first tool that we have queued up for release is a Python module allowing easy use of Google Language Tools, involving language detection and translation, with transliteration in an experimental state (Google has not yet released the API spec for the transliteration portion so that was reverse-engineered).

Now for some sample uses of the tool:

>> from googlelanguage import *

>> print lang_detect("this is a sentence in English")
{'isReliable': True, 'confidence': 0.31734600000000002, 'language': 'en'}

>> print lang_translate("comment dit on 'WebEcology' en francais?", dest_lang="en")
{'translatedText': 'how it says 'WebEcology' in French?', 'detectedSourceLanguage': 'fr'}

We used it ourselves to detect the language of each tweet in a sample of 1 million tweets from our database, with the following results:

We’ve also found it easy to combine the tool with SQLAlchemy to create metadata tables with linguistic information.

It is our hope that this small, MIT/X11-licensed release will prove useful to some in the Web Ecology community. Until we figure out which platform we’re going to use for open repository hosting, you can download the file here. And if you would like to contribute patches or additions, or if you have any questions, feel free to send them to Jon.Beilin@webecologyproject.org

I would also like to thank Sam Gilbert for his invaluable contributions, feedback, and support.