Reputation: 23
I have a data file which has a URL column in it. It looks something like this
"https://www.google.com/ | query_string=utm_source=abc&utm_medium=yts&utm_campaign=123campaign&utm_term=camp%123&utm_content=brand&gclid=abcdefg|user_agent=xyz"
I want these data in seperate columns with their respective values as shown below
utm_source utm_medium utm_campaign utm_term utm_content user_agent
abc yts 123campaign camp%123 brand xyz
Using dput for URL results in
c("https://www.google.com/ | query_string=null | ip_address=123.113.64.211 | user_agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36","https://www.google.com/ | query_string=gclid=Lxi6sNo-A17RohDAcQgvD_fw4 | ip_address=167.11.116.237 | user_agent=Mozilla/5.0 (Linux; Android 8.0.0; SM-C701F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36","http://m.facebook.com/ | query_string=utm_source=fb&utm_medium=ctw&utm_campaign=abcPant_rem&utm_content=PantShirt | ip_address=106.193.181.252 | user_agent=Mozilla/5.0 (Linux; Android 10; SM-G975F Build/QP1A.190711.020; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/81.0.4044.138 Mobile Safari/537.36 [FB_IAB/FB4A;FBAV/218.0.0.32.158;]")
Upvotes: 0
Views: 327
Reputation: 174348
Only one of the entries in URL
contains multiple fields in the query string, and the first one doesn't contain any. You can't really make a data frame from the example in the question, but you can make a list of named vectors containing the fields in the query string like this:
queries <- sapply(strsplit(sapply(strsplit(URL, "query_string="),
`[`, 2), " \\|"), `[`, 1)
lapply(strsplit(queries, "\\&|="), function(x)
setNames(x[seq(length(x)/2) * 2], x[seq(length(x)/2) * 2 - 1]))
#> [[1]]
#> null
#> NA
#>
#> [[2]]
#> gclid
#> "Lxi6sNo-A17RohDAcQgvD_fw4"
#>
#> [[3]]
#> utm_source utm_medium utm_campaign utm_content
#> "fb" "ctw" "abcPant_rem" "PantShirt"
Upvotes: 0
Reputation: 9107
Here's a regex solution using the URL provided.
url <- "https://www.google.com/ | query_string=utm_source=abc&utm_medium=yts&utm_campaign=123campaign&utm_term=camp%123&utm_content=brand&gclid=abcdefg|user_agent=xyz"
str_match_all
extracts patterns.
\\w+
: Match one or more word characters(...)
: Capture group(?:...)?
: Match group zero or one times, but don't capture the group. This is used to handle the query_string=
part of the URL.stringr::str_match_all(url, "(?:\\w+=)?(\\w+)=(\\w+)")
#> [[1]]
#> [,1] [,2] [,3]
#> [1,] "query_string=utm_source=abc" "utm_source" "abc"
#> [2,] "utm_medium=yts" "utm_medium" "yts"
#> [3,] "utm_campaign=123campaign" "utm_campaign" "123campaign"
#> [4,] "utm_term=camp" "utm_term" "camp"
#> [5,] "utm_content=brand" "utm_content" "brand"
#> [6,] "gclid=abcdefg" "gclid" "abcdefg"
#> [7,] "user_agent=xyz" "user_agent" "xyz"
str_match_all
returns a list of matrices where the first column is the complete match followed by each of the captured groups. Keep only the captured groups.
stringr::str_match_all(url, "(?:\\w+=)?(\\w+)=(\\w+)")[[1]][,2:3]
#> [,1] [,2]
#> [1,] "utm_source" "abc"
#> [2,] "utm_medium" "yts"
#> [3,] "utm_campaign" "123campaign"
#> [4,] "utm_term" "camp"
#> [5,] "utm_content" "brand"
#> [6,] "gclid" "abcdefg"
#> [7,] "user_agent" "xyz"
Upvotes: 0