Reputation: 83215
One of the variables in my dataset contains URLs of Google search results pages. I want to extract the search keywords from those URLs.
An example dataset:
keyw <- structure(list(user = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1", "p2"), class = "factor"),
url = structure(c(3L, 5L, 4L, 1L, 2L, 6L), .Label = c("https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw", "https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw#safe=off&q=five+short+fingers+", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+a+chair", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+handshake", "https://www.youtube.com/watch?v=6HOallAdtDI"), class = "factor")),
.Names = c("user", "url"), class = "data.frame", row.names = c(NA, -6L))
So far I was able to extract the search keyword parts from the URLs with:
keyw$words <- sapply(str_extract_all(keyw$url, 'q=([^&#]*)'),paste, collapse=",")
However, this still doesn't give me the desired result. The above code gives the following result:
> keyw$words
[1] "q=high+five"
[2] "q=high+five,q=high+five+with+handshake"
[3] "q=high+five,q=high+five+with+a+chair"
[4] "q=five+fingers"
[5] "q=five+fingers,q=five+short+fingers+"
[6] ""
There are three problems with this output:
q=high+five
, I need high,five
.NA
.The desired result should be:
> keyw$words
[1] "high,five"
[2] "high,five,with,handshake"
[3] "high,five,with,a,chair"
[4] "five,fingers"
[5] "five,short,fingers"
[6] NA
How do I solve this?
Upvotes: 16
Views: 2415
Reputation: 15784
Another update after comment (looks too complex but it's the best I can achieve at this point :)):
keyw$words <- sapply(str_extract_all(str_extract(keyw$url,"https?:[/]{2}[^/]*google.*[/].*"),'(?<=q=|[+])([^$+#&]+)(?!.*q=)'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five" "high,five,with,handshake" "high,five,with,a,chair" "five,fingers"
[5] "five,short,fingers" NA
The change is the filter on input to str_extract_all, changed from the full vector by a "filtered" one to match a regex, any regex can go there to match more or less precisely what you wish.
Here the regex is:
http
litteraly https?
0 or 1 s[/]{2}
exactly two slashes (using a character class avoid needing the ugly \\/
construction and get things more readable[^/]*
any number of not slash charactersgoogle.*[/]
match litteraly google followed by anything to the last /.*
finally match something or not after the last slashReplace * by + wherever you want to ensure there's a parameter (+
will require the preceding character to be present at least once)
Update heavily inspired by @BrodieG, will return NA if there's no match, but will still match any site if there's q=
in the parameters.
Still with the same method:
> keyw$words <- sapply(str_extract_all(keyw$url,'(?:(?<=q=|\\+)([^$+#&]+)(?!.*q=))'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five" "high,five,with,handshake" "high,five,with,a,chair"
[4] "five,fingers" "five,short,fingers" NA
(The lookbehind (?<=)
ensure there's q= or + somewhere before the word and the the negative lookahead (?!)
ensure we can't find q= untill the end of line.
The character class disallow the + sign to stop at each word.
Upvotes: 11
Reputation: 193527
There's got to be a cleaner way, but maybe something like:
sapply(strsplit(keyw$words, "q="), function(x) {
x <- if (length(x) == 2) x[2] else x[3]
gsub("+", ",", gsub("\\+$", "", x), fixed = TRUE)
})
# [1] "high,five" "high,five,with,handshake" "high,five,with,a,chair"
# [4] "five,fingers" "five,short,fingers"
Everything in one go:
keyw$words <- sapply(str_extract_all(keyw$url, 'q=([^&#]*)'),function(x) {
x <- if (length(x) == 2) x[2] else x[1]
x <- gsub("+", ",", gsub("\\+$", "", x), fixed = TRUE)
gsub("q=","",x, fixed = TRUE)
})
Upvotes: 3
Reputation: 52637
Update (borrowing part of the regex from David):
dat <- as.character(keyw$url)
pat <- "^https://www\\.google\\.nl/.*\\bq=([^&]*[^&+]).*"
sapply(
regmatches(dat, regexec(pat, dat)),
function(x) if(!length(x)) NA else gsub("\\+", ",", x[[2]])
)
Produces:
[1] "high,five" "high,five,with,handshake" "high,five,with,a,chair"
[4] "five,fingers" "five,short,fingers" NA
Using:
pat <- "^https://www\\.google.(?:com?.)?[a-z]{2,3}/.*\\b?q=([^&]*[^&+]).*"
takes into account all country specific google-domains (source)
Or:
gsub("\\+", ",", sub("^.*\\bq=([^&]*).*", "\\1", keyw$url))
Produces:
[1] "high,five" "high,five,with,handshake" "high,five,with,a,chair"
[4] "five,fingers" "five,short,fingers,"
Here we use greediness to make sure we skip everything up to the last q=...
part, and then use the standard sub
/ \\1
trick to capture what we want. Finally, replace +
with ,
.
Upvotes: 5
Reputation: 92292
Or maybe this
gsub("\\+", ",", gsub(".*q=([^&#]*[^+&]).*", "\\1", keyw$url))
# [1] "high,five" "high,five,with,handshake" "high,five,with,a,chair"
# [4] "five,fingers" "five,short,fingers"
Upvotes: 8
Reputation: 24480
I'd try with:
x<-as.character(keyw$url)
vapply(regmatches(x,gregexpr("(?<=q=)[^&]+",x,perl=TRUE)),
function(y) paste(unique(unlist(strsplit(y,"\\+"))),collapse=","),"")
#[1] "high,five" "high,five,with,handshake"
#[3] "high,five,with,a,chair" "five,fingers"
#[5] "five,fingers,short"
Upvotes: 3