user670186
user670186

Reputation: 2860

R replace all substrings that are websites

I have tried

gsub("/^(http?:\\/\\/)?([\\da-z\\.-]+)\\.([a-z\\.]{2,6})([\\/\\w \\.-]*)*\\/?$/","","This is a website http://www.example.com/test and needs to be removed",ignore.case=T, perl=T)

pattern is from: this website

Code runs but doesnt work. Any ideas?

Upvotes: 0

Views: 63

Answers (2)

Tyler Rinker
Tyler Rinker

Reputation: 110024

The rm_url function from the qdapRegex package that maintain is made for this. It has the added benefit of correcting the extra white space left behind:

library(qdapRegex)

rm_url("This is a website http://www.example.com/test and needs to be removed")
## [1] "This is a website and needs to be removed"

If you're interested in what the regex is for rm_url you can use the grab function on any qdapRegex function that uses a single regex and learn about the expression used:

grab("rm_url")
## [1] "(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"

Upvotes: 0

zessx
zessx

Reputation: 68820

Remove:

  • ^ and $, which match start/end of line
  • first and last /, which are delimiters, and are not required by gsub
  • the space , which avoid you to match the url only -currently, it catch all the end of the line)
gsub("(http?:\\/\\/)?([\\da-z\\.-]+)\\.([a-z\\.]{2,6})([\\/\\w\\.-]*)*\\/?","","This is a website http://www.example.com/test and needs to be removed",ignore.case=T, perl=T)

Try it

Upvotes: 1

Related Questions