Code Release: Language Detection and Translation
One of the tenets of Web Ecology is accessibility to the field through open tools and open data. At the Web Ecology Project, we’re working to get more of our code in a clean, commented, and releasable state. The first tool that we have queued up for release is a Python module allowing easy use of Google Language Tools, involving language detection and translation, with transliteration in an experimental state (Google has not yet released the API spec for the transliteration portion so that was reverse-engineered).
Now for some sample uses of the tool:
>> from googlelanguage import *
>> print lang_detect("this is a sentence in English")
{'isReliable': True, 'confidence': 0.31734600000000002, 'language': 'en'}
>> print lang_translate("comment dit on 'WebEcology' en francais?", dest_lang="en")
{'translatedText': 'how it says 'WebEcology' in French?', 'detectedSourceLanguage': 'fr'}
We used it ourselves to detect the language of each tweet in a sample of 1 million tweets from our database, with the following results:
We’ve also found it easy to combine the tool with SQLAlchemy to create metadata tables with linguistic information.
It is our hope that this small, MIT/X11-licensed release will prove useful to some in the Web Ecology community. Until we figure out which platform we’re going to use for open repository hosting, you can download the file here. And if you would like to contribute patches or additions, or if you have any questions, feel free to send them to Jon.Beilin@webecologyproject.org
I would also like to thank Sam Gilbert for his invaluable contributions, feedback, and support.
Pingback: Aracele Torres (araceletorres) 's status on Monday, 21-Sep-09 15:26:24 UTC - Identi.ca()
Pingback: Infosfera » Blog Archive » Português é a segunda lÃngua mais “falada†no Twitter()
Pingback: Português é a segunda lÃngua mais “falada†no Twitter « Sammy Fecury()
Pingback: Português é a segunda lÃngua no Twitter()
Pingback: Anonymous()
Pingback: Português é a segunda lÃngua mais “falada†no Twitter « Agência Natural()
Pingback: Miguel Branco (mglbranco) 's status on Wednesday, 23-Sep-09 22:25:58 UTC - Identi.ca()
Pingback: AnfÃbia: Agência Digital » Blog Archive » Português em segundo lugar()
Pingback: L’internet est-il vraiment universel? | Miller Ramos()
Pingback: Sabias que ... #1 | PC DEB()
Pingback: Português é Vice-Campeão do Twitter « Blogdetails by Mediadetails()
Pingback: A nosa lÃngua en twitter · Opaco()
Pingback: Top Languages on Twitter - dnlocal()
Pingback: Internet & Democracy Blog » 50 Million Tweets a Day()