Reputation: 71
I would like to sort my Logfile (~5 GB) for unique connection events. Unique (SRC_IP + DST_IP) only - but with timestamps and the other informations.
Example:
1 Feb 5 14:59:00 initf="eth0" outift="eth1" srcip="192.168.0.2" dstip="10.10.10.2"...
2 Feb 5 14:59:00 initf="eth0" outift="eth1" srcip="192.168.0.1" dstip="10.10.10.2"...
3 Feb 5 14:59:00 initf="eth0" outift="eth1" srcip="192.168.0.2" dstip="10.10.10.1"...
4 Feb 5 14:59:00 initf="eth0" outift="eth1" srcip="192.168.0.2" dstip="10.10.10.2"...
5 Feb 5 14:59:00 initf="eth0" outift="eth1" srcip="192.168.0.2" dstip="10.10.10.2"...
The output events should be:
1 Feb 5 14:59:00 initf="eth0" outift="eth1" srcip="192.168.0.2" dstip="10.10.10.2"...
2 Feb 5 14:59:00 initf="eth0" outift="eth1" srcip="192.168.0.1" dstip="10.10.10.2"...
3 Feb 5 14:59:00 initf="eth0" outift="eth1" srcip="192.168.0.2" dstip="10.10.10.1"...
because the combination of src + dst IP is unique. I tried this with sort -uk column but it doesn't work as intended. Also the column of src + dst IP are not consistent. It switches sometimes, because depending on the out-interface, the dstmac is submitted or not.
Maybe an AWK script could do the trick ?
EDIT
Since Karakfa made a good suggestion, solving this with awk - I am currently trying to change [$7,$8] into a regex
awk '!a[regexpression for src ip, regexpression for dst ip]++' file
Upvotes: 0
Views: 26
Reputation: 67467
assuming no spaces in the first 8 field values, this will give you the first appearance of the combination of the key.
$ awk '!a[$7,$8]++' file
This doesn't require sorted input (and won't change the order itself), you can pipe this into sort with your desired order. If the field order is not fixed, you can do something like this:
$ awk '{for(i=1;i<=NF;i++) if($i~/^srcip=/) s=$i; else if($i~/^dstip=/) d=$i}
!a[s,d]++;
{s=d=""}' file
Note that records with missing fields will be grouped as well. You may want to print all of those individually.
Upvotes: 1