Ignacio
Ignacio

Reputation: 7938

process destination string from log file to extract relevant data?

I'm trying to analyse a log file from nginx Particularly, i want to add to my data frame the first thing after the first / For example:

df1 <- structure(data.frame(V5 = c("GET /SOMETHING1/__assets__/shiny-server.js HTTP/1.1", 
                             "GET /SOMETHING2/shared/jquery.min.js HTTP/1.1", "GET /SOMETHING3/AdminLTE-2.0.6/AdminLTE.min.css HTTP/1.1", 
                             "POST /SOMETHING1/__sockjs__/n=B8x2Q3IWu2PhwngjN6/831/q6rt9t8u/xhr HTTP/1.1", 
                             "GET /SOMETHING3/shared/bootstrap/css/bootstrap.min.css HTTP/1.1")), class = "data.frame", row.names = c(NA, 
                                                                                                                                 -5L), .Names = "V5")

I would like to add Something to that data frame, and it would take the values SOMETHING1, SOMETHING2, SOMETHING1, SOMETHING3. Right now I'm playing with stringr and I can get a list that has the information that I want as the second element of each element of the list:

stringr::str_split(df1$V5,pattern = "/") 

Alas, I'm not sure how to use that to create the variable that I want.

Upvotes: 1

Views: 38

Answers (1)

Sotos
Sotos

Reputation: 51592

You can easily do it with regex and gsub but I would recommend to clean your URLs from get, post, http/1.1, etc. and then use urltools to extract domain, path, port, ...

clean_gateway <- function(x){
 z <- gsub("\\:[0-9]*$", "", gsub(" HTTP/1.1*$", "", x))
 y <- gsub("\\.*$", "", z)
 w <- gsub("^.*? ", "", y)
 w
 }

library(urltools)
df1$v5 <- clean_gateway(df1$V5)
url_parse(df1$v5)

Building on the methodology above,

gsub('/.*', '', url_parse(df1$V5)$path)
#[1] "SOMETHING1" "SOMETHING2" "SOMETHING3" "SOMETHING1" "SOMETHING3"

Upvotes: 1

Related Questions