Reputation: 295
I have a data frame containing tweets from the twitter API that has English and non-Engilsh tweets. Before posting this question, I have searched stack overflow and did not seem to find something that addresses what I am intending.
Since twitter has emojis, I want to filter out tweets that are not in English without consideration to emojis. I have tried using stringi::stri_enc_isascii()
but that does not seem to recognize English tweets with Emojis as English.
For replication purposes, here are some texts:
"私は、トランプ大統領を信じています🇺🇸🇯🇵 #America"
"Thank you Nashville"
"🇺🇸 Bless America"
In the final corpus, I should only have the last two texts.
Thank you!
Upvotes: 0
Views: 316
Reputation: 1751
You can remove all non-ASCII characters from your dataset by doing:
# assuming tweets is the field name where you store the tweets text messages
dataset$tweets <- sapply(dataset$tweets, function(x) gsub("[^\x01-\x7F]", "", x))
Then all your emojis and non-ascii characters will be left blank. The next step would be selecting only the rows where the tweets field is not empty.
dataset <- dataset[dataset$tweets != ""]
Now, if you want to keep the emojis, a better solution is to just do this process for indexing purposes and then use the index to filter the untouched data. For example:
modified_tweets <- sapply(dataset$tweets, function(x) gsub("[^\x01-\x7F]", "", x))
# now filter by condition
dataset <- dataset[modified_tweets != ""]
Upvotes: 2