Shijie Wang
Shijie Wang

Reputation: 133

Remove a sentence starting with a word in R?

I have a tweet text like this in R.

"RT @SportClipsUT125: #SavingLivesLooksGood with #RedCross. Donate this month & Get free haircut cpn. https://somewebsite https://somewebsite…"

How can I remove all the links (to remove duplicate tweets) so that the following tweet actually returns the string below?

"RT @SportClipsUT125: #SavingLivesLooksGood with #RedCross. Donate this month & Get free haircut" 

I have tried this:

gsub('https*','',test_str)

but it returns

"RT @SportClipsUT125: #SavingLivesLooksGood with #RedCross. Donate this           
month & Get free haircut cpn. ://somewebsite ://somewebsite…"

Upvotes: 0

Views: 1599

Answers (1)

Rilcon42
Rilcon42

Reputation: 9763

A simple solution is to change your gsub command:

gsub("http[s]*://[[:alnum:]]*", "", test_str) This will correctly remove URL's, both http and https versions

@alistaire's suggestion in the comments actually works in more cases is more understandable gsub('http\\S*', "", test_str) will remove anything starting with http. It will stop when it finds a space (which URL's do not have)

gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", test_str) to remove retweets

gsub("@\\w+", "", test_str) remove Atpeople

I would highly recommend putting your data in a corpus (a special data format), it makes things like removing often repeated words and URL's very easy. If you have a corpus of data you could do this:

corpus <- Corpus(VectorSource(my_data))
corpus = tm_map(corpus,content_transformer(function(x) iconv(x, to='UTF8', sub='byte')))
removeURL <- function(x) {gsub('http\\S*', "", x)}
corpus <- tm_map(corpus, content_transformer(removeURL))

Awesome link for examples on how to do all this: Text Mining Guide on Rpubs

Upvotes: 2

Related Questions