Reputation: 35
I have a 7GB csv file from which I need to find the first n values which occurred mostly in the 5th column. The format of my file is as:
"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","389","65311"
"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","88","49194"
"instance id","src IP","destination ip","12489","49194"
"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","63812","389"
Using the following command
awk -F , ' !count[$5]++ {save[$5] = $0; next} count[$5] > 0 { print save[$5] " is " count[$5] " times"} ' example.csv
able to find the following output
"instance id","src IP","destination ip","63812","389" is 1 times
"instance id","src IP","destination ip","389","65311" is 1 times
"instance id","src IP","destination ip","63812","389" is 2 times
"instance id","src IP","destination ip","88","49194" is 1 times
"instance id","src IP","destination ip","12489","49194" is 2 times
"instance id","src IP","destination ip","63812","389" is 3 times
"instance id","src IP","destination ip","63812","389" is 4 times
but not able to understand how to get the top 50 rows in which the 5th column entry duplicated most of the time.
Let's suppose if n=2, then my output should be like following:
"instance id","src IP","destination ip","63812","389" is 4 times
"instance id","src IP","destination ip","12489","49194" is 2 times
Upvotes: 1
Views: 247
Reputation: 11188
If you don't need to use awk
, the command-line tool GoCSV has a number of sub-commands for getting you the top n unique values for a column, sorted and culled by the count of those values.
GoCSV expects a header, so for your input the first step is to add some default column names (which can be stripped away later):
gocsv cap -default-name 'Col' input.csv
With a header in place you can pipe that output to a series of commands that will:
uniq
)... \
| gocsv uniq -c 5 -count \
| gocsv sort -c 6 -reverse
| gocsv head -n 50
Running all that I get:
Col 1,Col 2,Col 3,Col 4,Col 5,Count
instance id,src IP,destination ip,63812,389,4
instance id,src IP,destination ip,88,49194,2
instance id,src IP,destination ip,389,65311,1
To get rid of the header, just pipe that into gocsv behead
.
It's pre-built for a number of platform/OS-es.
Upvotes: 1
Reputation: 26481
This method is using GNU awk extensions:
Return n most frequent values in column p:
awk 'BEGIN{n=50; p=5; PROCINFO["sorted_in"]="@val_num_desc"} {a[$p]++}
END { for(i in a) { if (!n--) { break }; print i } }' file
Return last record of n most frequent values in column p:
awk 'BEGIN{n=50; p=5; PROCINFO["sorted_in"]="@val_num_desc"} {a[$p]++;b[$p]=$0}
END { for(i in a) { if (!n--) { break }; print b[i] } }' file
Adopting the above to the expected output of the OP:
awk 'BEGIN{n=50; p=5; FS=","; PROCINFO["sorted_in"]="@val_num_desc"}
{a[$p]++;b[$p]=$0}
END { for(i in a) { if (!n--) { break }; print b[i],"is",a[i],"times" } }' file
Upvotes: 1
Reputation: 203655
Using any sort+awk+head:
$ sort -t, -k5,5 file |
awk '
BEGIN { FS=OFS="," }
$5 != p5 { if (NR>1) print p0, cnt; p5=$5; p0=$0; cnt=0 }
{ cnt++ }
END { print p0, cnt }' |
sort -t, -k6,6rn |
head -2
"instance id","src IP","destination ip","63812","389",4
"instance id","src IP","destination ip","12489","49194",2
Upvotes: 1
Reputation: 17491
If I were you, I would alter the output into something like:
1 times : "instance id","src IP","destination ip","63812","389" is 1 times
1 times : "instance id","src IP","destination ip","389","65311" is 1 times
2 times : "instance id","src IP","destination ip","63812","389" is 2 times
1 times : "instance id","src IP","destination ip","88","49169" is 1 times
2 times : "instance id","src IP","destination ip","12489","49194" is 2 times
3 times : "instance id","src IP","destination ip","63812","389" is 3 times
4 times : "instance id","src IP","destination ip","63812","389" is 4 times
And then add | sort -n
after your command.
(I've been trying using sort -kx -n
but the combination of double quotes, commas and spaces messes up the -k
switch of the sort
command.)
Upvotes: 1