Filter Dataframe from Twitter API to exclude non-English text in R

Question

I have a data frame containing tweets from the twitter API that has English and non-Engilsh tweets. Before posting this question, I have searched stack overflow and did not seem to find something that addresses what I am intending.

Since twitter has emojis, I want to filter out tweets that are not in English without consideration to emojis. I have tried using stringi::stri_enc_isascii() but that does not seem to recognize English tweets with Emojis as English.

For replication purposes, here are some texts:

"私は、トランプ大統領を信じています🇺🇸🇯🇵 #America"
"Thank you Nashville"
"🇺🇸 Bless America"

In the final corpus, I should only have the last two texts.

Thank you!

eduardokapp · Accepted Answer

You can remove all non-ASCII characters from your dataset by doing:

# assuming tweets is the field name where you store the tweets text messages
dataset$tweets <- sapply(dataset$tweets, function(x) gsub("[^\x01-\x7F]", "", x))

Then all your emojis and non-ascii characters will be left blank. The next step would be selecting only the rows where the tweets field is not empty.

dataset <- dataset[dataset$tweets != ""]

Now, if you want to keep the emojis, a better solution is to just do this process for indexing purposes and then use the index to filter the untouched data. For example:

modified_tweets <- sapply(dataset$tweets, function(x) gsub("[^\x01-\x7F]", "", x))

# now filter by condition
dataset <- dataset[modified_tweets != ""]

Filter Dataframe from Twitter API to exclude non-English text in R

Answers (1)

Related Questions