Reputation: 33
I am trying to remove urls that may or may not start with http/https from a large text file, which I saved in urldoc in R. The url may start like tinyurl.com/ydyzzlkk or aclj.us/2y6dQKw or pic.twitter.com/ZH08wej40K. Basically I want to remove data before a '/' after finding the space and after a "/" until I find a space. I tried with many patterns and searched many places. Couldn't complete the task. I would help me a lot if you could give some input.
This is the last statement I tried and got stuck for the above problem. urldoc = gsub("?[a-z]+\..\/.[\s]$","", urldoc)
Input would be: A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk
Output I am expecting is: A disgrace to his profession. In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. nothing like the admin. proposal:
Thanks.
Upvotes: 3
Views: 896
Reputation: 2727
See already answered, but here is an alternative if you've not come across stringi
before
# most complete package for string manipulation
library(stringi)
# text and regex
text <- "A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk"
pattern <- "(?:\\s)[^\\s\\.]*\\.[^\\s]+"
# see what is captured
stringi::stri_extract_all_regex(text, pattern)
# remove (replace with "")
stringi::stri_replace_all_regex(text, pattern, "")
Upvotes: 1
Reputation: 5958
This might work:
text <- " http:/thisisanurl.wde , thisaint , nope , uihfs/yay"
words <- strsplit(text, " ")[[1]]
isurl <- sapply(words, function(x) grepl("/",x))
result <- paste0(words[!isurl], collapse = " ")
result
[1] " , thisaint , nope ,"
Upvotes: 0
Reputation: 627341
According to your specs, you may use the following regex:
\s*[^ /]+/[^ /]+
See the regex demo.
Details
\s*
- 0 or more whitespace chars[^ /]+
(or [^[:space:]/]
) - any 1 or more chars other than space (or whitespace) and /
/
- a slash[^ /]+
(or [^[:space:]/]
) - any 1 or more chars other than space (or whitespace) and /
.urldoc = gsub("\\s*[^ /]+/[^ /]+","", urldoc)
If you want to account for any whitespace, replace the literal space with [:space:]
,
urldoc = gsub("\\s*[^[:space:]/]+/[^[:space:]/]+","", urldoc)
Upvotes: 2