Aniks
Aniks

Reputation: 1131

Remove urls from strings

I have the following string, stored in the object sentence:

sentence <- "aazdlubtirol: RT @tradeDayTrades: sister articles \"$AAPL Dancing in a Burning Room\" January 2013  http://t.co/tkuCRfLy  \" $AAPL vs $AAPL \"  August 2011 http://t.co/863HkVjn"

I am trying to use gsub to remove urls beginning with http:

sentence <- gsub('http.*','',sentence)

However, it replaces everything after http:

aazdlubtirol: RT @tradeDayTrades: sister articles \"$AAPL Dancing in a Burning Room\" January 2013

What I want is:

aazdlubtirol: RT @tradeDayTrades: sister articles \"$AAPL Dancing in a Burning Room\" January 2013 \" $AAPL vs $AAPL \" August 2011

I am trying to clean up the urls so if a string includes http I want to remove the url. I found some solutions but they are not helping me.

Upvotes: 7

Views: 6448

Answers (1)

Justin
Justin

Reputation: 43255

Add a space to your replacement group:

gsub('http.* *', '', sentence)

Or using \\s which is regex for space:

gsub('http.*\\s*', '', sentence)

As per the comment, .* will match anything and regular expressions are greedy. Instead we should match one or more non-whitespace character any number of times followed by zero or more spaces:

gsub('http\\S+\\s*', '', sentence)

Upvotes: 9

Related Questions