Reputation: 31
I'm getting tweets from the twitter streaming api and I found that some of them have language code "in" as a parameter. The language code is supposed to be on the ISO 639-1 standard, but I haven't been able to find to which language does it correspond. Somebody knows it?
Upvotes: 2
Views: 1971
Reputation: 1
I streamed about 6h worth of tweets geolocated in Asia and took a look- annoyingly, the 'in' code catches tweets in Indonesian (Bahasa Indonesia), Malay (Bahasa Malaysia) - two similar languages - as well as Hindi, but typed in roman letters (I checked with someone fluent in Hindi).
I also looked at the tweets marked as coming from Malaysia (country_code 'MY'), where the main language spoken is Malay/Bahasa Malaysia (code 'my'), and the vast majority of tweets were marked as 'in'. Given how close the two languages are, I'm not surprised that whatever Twitter's done here with the 'in' code classifies them as the same language.
Furthermore, Indonesian has quite a few loan words from Hindi.
Upvotes: 0
Reputation: 63
As described in the Twitter developer documentation, 'in' is used for Indonesian (web archive link for future reference):
In their documentation they say they're using the BCP 47 standard, which in term refers to ISO 639, of which, as mentioned in one of the other answers, only an ancient version refers to Indonesian as 'in'. It looks a bit like they developed something, and then tried to find a standard that kind-a describes what they developed...
Anyway, I don't know about the precision of 'in' language detection at Twitter, so before you make this a big factor in you application, check for yourself how well this works. From my own experience I know that Tweets in the 'Swahili' language, which are not supported by Twitter language detection, are often assigned tagalog ('tl'), making the 'tl' classification pretty unreliable...
Upvotes: 0
Reputation: 1701
According to Wikipedia, "in" is the former ISO 639-1 language code for Indonesian ("id" is used since November 3, 1989), but that seems weird.
What I did is this search: it gives you a bunch of tweets in this strange "in" language, and you just have to click the grey "show translation" thingie to have Bing do the work for you. Since all the tweets I clicked are either in Malay or in Indonesian (that seems to be a standardized register of Malay, whatever that means), I would say that "in" encompasses both of them, which seem to be the two major languages spoken in Indonesia.
In most cases where you do not know what a language is, just throw some lines into Google Translate and ask it to automatically detect the language for you, that should at least give you a big hint.
Upvotes: 2