Reputation: 741
I have server log data format that I want to parse.
here the first two rows
test <- c("5638052581 \"Norway|Oslo County|Oslo|3163036322|503858711|160449504|y|\" n - - [31/Oct/2019:13:00:01 +0000] \"GET /P04_AL?args=app_01&distributor=p4&player=app&playeros=ios&referrer=1&station=1&codec=aac&quality=low&deviceid=1D6A84DA-92A6-4AD1-A2A3-1AB20D2263B2&listenerid=61D1F2EB-7B35-4434-9D8B-A6D074BE28F0&userid=fczUdjf5yEU8j4JlZHG4JXABgiZ2&aw_0_1st.audience=%5B%22P7ActiveListeners%22,%20%22p5hitsactive%22,%20%22P6ActiveListeners%22,%20%22P4ActiveListeners%22,%20%22AppInstalledP4%22%5D HTTP/1.1\" 200 4305805 \"-\" \"AppleCoreMedia//1.0.0.17B84 (iPhone; U; CPU OS 13_2 like Mac OS X; nb_no)\" 702", "616118387 \"Netherlands|North Holland|Haarlem|631068861|616118387|862817723||\" n - - [31/Oct/2019:13:00:01 +0000] \"GET /P04_MH HTTP/1.1\" 200 519546 \"-\" \"MultiRoomAudioPlayer//5.1\" 6")
I am trying to use rex package like below but keep facing error about unexpected input. What I am doing wrong? Can somebody help me with this. Here is my try for only one record ( first element of vector)
library(rex)
re_logic <- rex(
capture(name = "process_id", digits),
"`\´",
capture(name = "country", non_spaces),
"|",
capture(name = "county", non_spaces),
"|",
capture(name = "city", non_spaces),
"|",
capture(name = "x1", digits),
"|",
capture(name = "x2", digits),
"|",
capture(name = "x3", digits),
"|",
capture(name = "process_name", alpha),
"`n - -´",
spaces,
"[",
capture(name = "accept_date", except_some_of("]")),
"]",
spaces,
"`\´",
capture(name = "http_request", non_quotes),
"`\´",
spaces,
capture(name = "status_code", digits),
spaces,
capture(name = "bytes_read", some_of("+", digit)),
"`" \"´",
capture(name = "actconn", digits),
"`//´",
spaces,
"(",
capture(name = "Tr", non_quotes),
";" )
# sample view
re_matches(test, re_logic) %>% as_tibble()
Upvotes: 2
Views: 57
Reputation: 627292
You can use
re_logic <- rex(
capture(name = "process_id", digits),
spaces, quote,
capture(name = "country", except_some_of("|")),
"|",
capture(name = "county", except_some_of("|")),
"|",
capture(name = "city", except_some_of("|")),
"|",
capture(name = "x1", digits),
"|",
capture(name = "x2", digits),
"|",
capture(name = "x3", digits),
"|",
capture(name = "process_name", zero_or_more(alpha)),
"|", quote, spaces, "n", spaces, "-", spaces, "-",spaces,
"[",
capture(name = "accept_date", except_some_of("]","[")),
"]",
spaces, quote,
capture(name = "http_request", non_quotes),
quote, spaces,
capture(name = "status_code", digits),
spaces,
capture(name = "bytes_read", some_of("+", digit)),
spaces, quote, non_quotes, quote, spaces, quote,
capture(name = "actconn", except_some_of(quote, "/")),
"/", non_spaces,
maybe(
spaces, "(",
capture(name = "Tr", except_some_of(";"))
)
)
re_matches(test, re_logic)
See the regex demo.
NOTES:
quote
to match any '
or "
charnon_spaces
to match geonames, I used any chars but |
pattern, except_some_of("|")
Tr
part is optional, so you need to wrap the pattern chain related to that group with maybe
clause.Upvotes: 1