Reputation: 8041
I want to use the Regex by John Gruber (http://daringfireball.net/2010/07/improved_regex_for_matching_urls) to match complex URLs in text blocks. The Regex is quite complex (as is the task, see regex to find url in a text).
My problem is that I don't get it work with R:
x <- c("http://foo.com/blah_blah",
"http://foo.com/blah_blah/",
"(Something like http://foo.com/blah_blah)",
"http://foo.com/blah_blah_(wikipedia)",
"http://foo.com/more_(than)_one_(parens)",
"(Something like http://foo.com/blah_blah_(wikipedia))",
"http://foo.com/blah_(wikipedia)#cite-1",
"http://foo.com/blah_(wikipedia)_blah#cite-1",
"http://foo.com/unicode_(✪)_in_parens",
"http://foo.com/(something)?after=parens",
"http://foo.com/blah_blah.",
"http://foo.com/blah_blah/.",
"<http://foo.com/blah_blah>",
"<http://foo.com/blah_blah/>",
"http://foo.com/blah_blah,",
"http://www.extinguishedscholar.com/wpglob/?p=364.",
"http://✪df.ws/1234",
"rdar://1234",
"rdar:/1234",
"x-yojimbo-item://6303E4C1-6A6E-45A6-AB9D-3A908F59AE0E",
"message://%[email protected]%3e",
"http://➡.ws/䨹",
"www.c.ws/䨹",
"<tag>http://example.com</tag>",
"Just a www.example.com link.",
"http://example.com/something?with,commas,in,url, but not at end",
"What about <mailto:[email protected]?subject=TEST> (including brokets).",
"mailto:[email protected]",
"bit.ly/foo",
"“is.gd/foo/”",
"WWW.EXAMPLE.COM",
"http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))/Web_ENG/View_DetailPhoto.aspx?PicId=752",
"http://www.asianewsphoto.com/(S(neugxif4twuizg551ywh3f55))",
"http://lcweb2.loc.gov/cgi-bin/query/h?pp/horyd:@field(NUMBER+@band(thc+5a46634))")
t <- regexec("\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:'".,<>?«»“”‘’]))", x)
regmatches(x,t)
I appreciate your help.
Upvotes: 0
Views: 146
Reputation: 8041
I ended up using gregexpr
as this supports perl=TRUE
. After adapting the Regex for R, I came up with the following solution (use data above).
findURL <- function(x){
t <- gregexpr("(?xi)\\b(
(?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)
(?:[^\\s\\(\\)<>]+|\\(([^\\s\\(\\)<>]+|(\\([^\\s\\(\\)<>]+\\)))*\\))+
(?:\\(([^\\s\\(\\)<>]+|(\\([^\\s\\(\\)<>]+\\)))*\\)|[^\\s`!\\(\\)\\[\\]{};:'\\\"\\.,<>\\?«»“”‘’])
)",x, perl=TRUE, fixed=FALSE)
regmatches(x,t)
}
# Find URLs
urls <- findURL(x)
# Count URLs
count.urls.temp <- lapply(urls, length)
count.urls <- sum(unlist(count.urls.temp))
I hope this is helpful for others.
Upvotes: 1