Reputation: 530
Is it possible to use a Regex function that ignores SOME punctuation (not "/
's") at the end of URL strings (i.e. punctuation at the end of a url string followed by a space) when extracted? When extracting URLs, I'm getting periods, parenthesizes, question marks and exclamation points at the end of the strings I extract so for example:
findURL <- function(x){
m <- gregexpr("http[^[:space:]]+", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
x <- "find out more at http://bit.ly/SS/VUEr). check it out here http://bit.ly/14pwinr)? http://bit.ly/108vJOM! Now!"
findURL(x)
[1] http://bit.ly/SS/VUEr).http://bit.ly/14pwinr)? http://bit.ly/108vJOM!
And
findURL2 <- function(x){
m <- gregexpr("www[^[:space:]]+", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
y <- "This is an www.example.com/store/locator. of the type of www.example.com/Google/Voice. data I'd like to extract www.example.com/network. get it?"
findURL2(y)
[1] www.example.com/store/locator. www.example.com/Google/Voice. www.example.com/network.
Is there a way to modify these functions so that if a . ) ?
!
or ,
OR (IF POSSIBLE) a ). )? )!
or ),
is found at the end of the string followed by a space (i.e. if punctuation: periods, parenthesizes, question marks, exclamation points, or comma's at the end of a URL string followed by a space) to NOT extract them?
Upvotes: 0
Views: 71
Reputation: 174706
Use a positive lookahead and also you may combine the both...
findURL <- function(x){
m <- gregexpr("\\b(?:www|http)[^[:space:]]+?(?=[^\\s\\w]*(?:\\s|$))", x, perl=TRUE)
w <- unlist(regmatches(x,m))
op <- paste(w,collapse=" ")
return(op)
}
x <- "find out more at http://bit.ly/SS/VUEr). check it out here http://bit.ly/14pwinr)? http://bit.ly/108vJOM! Now!"
y <- "This is an www.example.com/store/locator. of the type of www.example.com/Google/Voice. data I'd like to extract www.example.com/network. get it?"
findURL(x)
findURL(y)
# [1] "http://bit.ly/SS/VUEr http://bit.ly/14pwinr http://bit.ly/108vJOM"
# [1] "www.example.com/store/locator www.example.com/Google/Voice www.example.com/network"
Upvotes: 2