DanG
DanG

Reputation: 741

Parsing Server Log with Rex package in R

I have server log data format that I want to parse.

here the first two rows

test <- c("5638052581 \"Norway|Oslo County|Oslo|3163036322|503858711|160449504|y|\" n - - [31/Oct/2019:13:00:01 +0000] \"GET /P04_AL?args=app_01&distributor=p4&player=app&playeros=ios&referrer=1&station=1&codec=aac&quality=low&deviceid=1D6A84DA-92A6-4AD1-A2A3-1AB20D2263B2&listenerid=61D1F2EB-7B35-4434-9D8B-A6D074BE28F0&userid=fczUdjf5yEU8j4JlZHG4JXABgiZ2&aw_0_1st.audience=%5B%22P7ActiveListeners%22,%20%22p5hitsactive%22,%20%22P6ActiveListeners%22,%20%22P4ActiveListeners%22,%20%22AppInstalledP4%22%5D HTTP/1.1\" 200 4305805 \"-\" \"AppleCoreMedia//1.0.0.17B84 (iPhone; U; CPU OS 13_2 like Mac OS X; nb_no)\" 702", "616118387 \"Netherlands|North Holland|Haarlem|631068861|616118387|862817723||\" n - - [31/Oct/2019:13:00:01 +0000] \"GET /P04_MH HTTP/1.1\" 200 519546 \"-\" \"MultiRoomAudioPlayer//5.1\" 6")

I am trying to use rex package like below but keep facing error about unexpected input. What I am doing wrong? Can somebody help me with this. Here is my try for only one record ( first element of vector)

library(rex)
re_logic <- rex(
  
  capture(name = "process_id", digits),
  "`\´",
  capture(name = "country", non_spaces),
  "|",
  capture(name = "county", non_spaces),
  "|",
  capture(name = "city", non_spaces), 
  "|",
  capture(name = "x1", digits), 
  "|",
  capture(name = "x2", digits),
  "|",
  capture(name = "x3", digits),
  "|",
  capture(name = "process_name", alpha),
  "`n - -´",
  spaces,
  "[",
  capture(name = "accept_date", except_some_of("]")),
  "]",
  spaces,
  "`\´",
  capture(name = "http_request", non_quotes),
  "`\´",
  spaces,
  capture(name = "status_code", digits),
  spaces,
  capture(name = "bytes_read", some_of("+", digit)),
  "`" \"´",
  capture(name = "actconn", digits),
  "`//´",
  spaces,
  "(",
  capture(name = "Tr", non_quotes),
  ";" )
  

# sample view
re_matches(test, re_logic) %>% as_tibble()

Upvotes: 2

Views: 57

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627292

You can use

re_logic <- rex(
  capture(name = "process_id", digits),
  spaces, quote,
  capture(name = "country", except_some_of("|")),
  "|",
  capture(name = "county", except_some_of("|")),
  "|",
  capture(name = "city", except_some_of("|")), 
  "|",
  capture(name = "x1", digits), 
  "|",
  capture(name = "x2", digits),
  "|",
  capture(name = "x3", digits),
  "|",
  capture(name = "process_name",  zero_or_more(alpha)),
  "|", quote, spaces, "n", spaces, "-", spaces, "-",spaces,
  "[",
  capture(name = "accept_date", except_some_of("]","[")),
  "]",
  spaces, quote,
  capture(name = "http_request", non_quotes),
  quote, spaces,
  capture(name = "status_code", digits),
  spaces,
  capture(name = "bytes_read", some_of("+", digit)),
  spaces, quote, non_quotes, quote, spaces, quote,
  capture(name = "actconn", except_some_of(quote, "/")),
  "/", non_spaces,
  maybe(
    spaces, "(",
    capture(name = "Tr", except_some_of(";"))
  )
)
re_matches(test, re_logic)

See the regex demo.

NOTES:

  • I used quote to match any ' or " char
  • Instead of non_spaces to match geonames, I used any chars but | pattern, except_some_of("|")
  • The Tr part is optional, so you need to wrap the pattern chain related to that group with maybe clause.

Upvotes: 1

Related Questions