naeem
naeem

Reputation: 35

Find the n largest occurrences of duplicate values in a column

I have a 7GB csv file from which I need to find the first n values which occurred mostly in the 5th column. The format of my file is as:

"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","389","65311"
"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","88","49194"
"instance id","src IP","destination ip","12489","49194"
"instance id","src IP","destination ip","63812","389"
"instance id","src IP","destination ip","63812","389"

Using the following command awk -F , ' !count[$5]++ {save[$5] = $0; next} count[$5] > 0 { print save[$5] " is " count[$5] " times"} ' example.csv able to find the following output

"instance id","src IP","destination ip","63812","389" is 1 times
"instance id","src IP","destination ip","389","65311" is 1 times
"instance id","src IP","destination ip","63812","389" is 2 times
"instance id","src IP","destination ip","88","49194" is 1 times
"instance id","src IP","destination ip","12489","49194" is 2 times
"instance id","src IP","destination ip","63812","389" is 3 times
"instance id","src IP","destination ip","63812","389" is 4 times

but not able to understand how to get the top 50 rows in which the 5th column entry duplicated most of the time.

Let's suppose if n=2, then my output should be like following:

"instance id","src IP","destination ip","63812","389" is 4 times
"instance id","src IP","destination ip","12489","49194" is 2 times

Upvotes: 1

Views: 247

Answers (4)

Zach Young
Zach Young

Reputation: 11188

If you don't need to use awk, the command-line tool GoCSV has a number of sub-commands for getting you the top n unique values for a column, sorted and culled by the count of those values.

GoCSV expects a header, so for your input the first step is to add some default column names (which can be stripped away later):

gocsv cap -default-name 'Col' input.csv

With a header in place you can pipe that output to a series of commands that will:

  • only keep unique values from the 5th column while adding the count for how many times the values were seen
  • sort descending by the count (a new 6th column from uniq)
  • then keeping only 50 rows (the "top 50 counts")
... \
| gocsv uniq -c 5 -count \
| gocsv sort -c 6 -reverse
| gocsv head -n 50

Running all that I get:

Col 1,Col 2,Col 3,Col 4,Col 5,Count
instance id,src IP,destination ip,63812,389,4
instance id,src IP,destination ip,88,49194,2
instance id,src IP,destination ip,389,65311,1

To get rid of the header, just pipe that into gocsv behead.

It's pre-built for a number of platform/OS-es.

Upvotes: 1

kvantour
kvantour

Reputation: 26481

This method is using GNU awk extensions:

Return n most frequent values in column p:

awk 'BEGIN{n=50; p=5; PROCINFO["sorted_in"]="@val_num_desc"} {a[$p]++}
     END { for(i in a) { if (!n--) { break }; print i } }' file

Return last record of n most frequent values in column p:

awk 'BEGIN{n=50; p=5; PROCINFO["sorted_in"]="@val_num_desc"} {a[$p]++;b[$p]=$0}
     END { for(i in a) { if (!n--) { break }; print b[i] } }' file

Adopting the above to the expected output of the OP:

awk 'BEGIN{n=50; p=5; FS=","; PROCINFO["sorted_in"]="@val_num_desc"}
     {a[$p]++;b[$p]=$0}
     END { for(i in a) { if (!n--) { break }; print b[i],"is",a[i],"times" } }' file

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203655

Using any sort+awk+head:

$ sort -t, -k5,5 file |
    awk '
        BEGIN { FS=OFS="," }
        $5 != p5 { if (NR>1) print p0, cnt; p5=$5; p0=$0; cnt=0 }
        { cnt++ }
        END { print p0, cnt }' |
    sort -t, -k6,6rn |
    head -2
"instance id","src IP","destination ip","63812","389",4
"instance id","src IP","destination ip","12489","49194",2

Upvotes: 1

Dominique
Dominique

Reputation: 17491

If I were you, I would alter the output into something like:

1 times : "instance id","src IP","destination ip","63812","389" is 1 times
1 times : "instance id","src IP","destination ip","389","65311" is 1 times
2 times : "instance id","src IP","destination ip","63812","389" is 2 times
1 times : "instance id","src IP","destination ip","88","49169" is 1 times
2 times : "instance id","src IP","destination ip","12489","49194" is 2 times
3 times : "instance id","src IP","destination ip","63812","389" is 3 times
4 times : "instance id","src IP","destination ip","63812","389" is 4 times

And then add | sort -n after your command.

(I've been trying using sort -kx -n but the combination of double quotes, commas and spaces messes up the -k switch of the sort command.)

Upvotes: 1

Related Questions