Agecanonix
Agecanonix

Reputation: 11

Parsing log files with « R »

I'd like to get all source IPs from some firewall logs.

First, which import method do you recommend when it comes to import logs that have different row size?

Sample rawdata:

Sep  7 13:10:01 XXX.XXX.XXX.XXX id=firewall time="2018-09-07 13:10:01" fw="XXXXX-ISSP" tz=+0200 startime="2018-09-07 13:10:00" pri=5 confid=01 slotlevel=2 ruleid=102 srcif="vlan3" srcifname="XXXXX" ipproto=tcp dstif="vlan6" dstifname="XXXXX" proto=tcp5666 src=XXX.XXX.XXX.XXX srcport=55617 srcportname=ephemeral_fw_tcp srcname=XXXXX.service.noissp.XXXXX.corp srcmac=YY:YY:YY:YY:YY:YY dst=10.95.160.7 dstport=5666 dstportname=tcp5666 dstname=XXXXX.biz.noissp.XXXXX.corp modsrc=XXX.XXX.XXX.XXX modsrcport=55617 origdst=XXX.XXX.XXX.XXX origdstport=5666 ipv=4 sent=1412 rcvd=1596 duration=0.18 action=pass logtype="connection"
Sep  7 13:10:01 XXX.XXX.XXX.XXX id=firewall time="2018-09-07 13:10:01" fw="XXXXX-ISSP" tz=+0200 startime="2018-09-07 13:10:00" pri=5 confid=01 slotlevel=2 ruleid=810 srcif="vlan3" srcifname="XXXXX" ipproto=udp dstif="Ethernet18" dstifname="FTLAN-XXX" proto=syslog src=XXX.XXX.XXX.XXX srcport=36147 srcportname=ephemeral_fw_udp srcname=XXXXX.service.noissp.XXXXX.corp srcmac=YY:YY:YY:YY:YY:YY dst=XXX.CXX.CXX.XXX dstport=514 dstportname=syslog dstname=XXXXX ipv=4 action=block logtype="filter"
Sep  7 13:10:01 XXX.XXX.XXX.XXX id=firewall time="2018-09-07 13:10:01" fw="XXXXX-ISSP" tz=+0200 startime="2018-09-07 12:10:00" pri=5 confid=01 slotlevel=2 ruleid=273 srcif="vlan6" srcifname="XXXXX" ipproto=udp dstif="vlan6" dstifname="XXXXX" proto=dns_udp src=XXX.XXX.XXX.XXX srcport=60737 srcportname=XXX-dyn_tcp srcmac=YY:YY:YY:YY:YY:YY dst=XXX.XXX.XXX.XXX dstport=53 dstportname=dns_udp dstname=XXXXX-biznoIssp.biz.noissp modsrc=XXX.XXX.XXX.XXX modsrcport=60737 origdst=XXX.XXX.XXX.XXX origdstport=53 ipv=4 sent=54 rcvd=114 duration=0.00 action=pass logtype="connection"

I have tried with read_line to avoid getting an error with line of different sizes:

Import log file

rawdata <- read_lines(file="./input.txt")

Remove double quotes on each line

a <- gsub("\"" , "", rawdata)

Split the line into multiple strings

b <- str_split(a, " ")

But at this point, b is only a vector:

> dim(b)
NULL
> length(b)
[1] 10

str_subset(b, "src=") returns full line instead of single column. I'm doing something wrong...

How may I extract this information?

Upvotes: 0

Views: 97

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 146030

All your code is fine. str_split returns a list:

class(b)
# [1] "list"

length(b)
# [1] 3
lengths(b)
# [1] 41 34 40

One list item for each of you input lines, each list item is a vector (raw data split by spaces). We can sapply (or lapply) str_subset to each list item:

sapply(b, str_subset, pattern = "src=")
# [[1]]
# [1] "src=XXX.XXX.XXX.XXX"    "modsrc=XXX.XXX.XXX.XXX"
# 
# [[2]]
# [1] "src=XXX.XXX.XXX.XXX"
# 
# [[3]]
# [1] "src=XXX.XXX.XXX.XXX"    "modsrc=XXX.XXX.XXX.XXX"

You might want to modify the regex to exclude the modrc entries:

sapply(b, str_subset, pattern = "^src=")
# [1] "src=XXX.XXX.XXX.XXX" "src=XXX.XXX.XXX.XXX" "src=XXX.XXX.XXX.XXX"

We could also go directly from rawdata without splitting or anything:

str_extract_all(rawdata, pattern = "\\bsrc=[^ ]*")
# [[1]]
# [1] "src=XXX.XXX.XXX.XXX"
# 
# [[2]]
# [1] "src=XXX.XXX.XXX.XXX"
# 
# [[3]]
# [1] "src=XXX.XXX.XXX.XXX"

Upvotes: 1

Related Questions