Reputation: 133
I have a tweet text like this in R.
"RT @SportClipsUT125: #SavingLivesLooksGood with #RedCross. Donate this month & Get free haircut cpn. https://somewebsite https://somewebsite…"
How can I remove all the links (to remove duplicate tweets) so that the following tweet actually returns the string below?
"RT @SportClipsUT125: #SavingLivesLooksGood with #RedCross. Donate this month & Get free haircut"
I have tried this:
gsub('https*','',test_str)
but it returns
"RT @SportClipsUT125: #SavingLivesLooksGood with #RedCross. Donate this
month & Get free haircut cpn. ://somewebsite ://somewebsite…"
Upvotes: 0
Views: 1599
Reputation: 9763
A simple solution is to change your gsub command:
gsub("http[s]*://[[:alnum:]]*", "", test_str)
This will correctly remove URL's, both http and https versions
@alistaire's suggestion in the comments actually works in more cases is more understandable gsub('http\\S*', "", test_str)
will remove anything starting with http. It will stop when it finds a space (which URL's do not have)
gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", test_str)
to remove retweets
gsub("@\\w+", "", test_str)
remove Atpeople
I would highly recommend putting your data in a corpus (a special data format), it makes things like removing often repeated words and URL's very easy. If you have a corpus of data you could do this:
corpus <- Corpus(VectorSource(my_data))
corpus = tm_map(corpus,content_transformer(function(x) iconv(x, to='UTF8', sub='byte')))
removeURL <- function(x) {gsub('http\\S*', "", x)}
corpus <- tm_map(corpus, content_transformer(removeURL))
Awesome link for examples on how to do all this: Text Mining Guide on Rpubs
Upvotes: 2