Unix: Find duplicate occurrences in column in csv file, omit one possible value

Question

I am hoping for a line or two of code for a bash script to find and print repeated items in a column in 2.5G csv file except for an item that I know is commonly repeated.

The data file has a header, but it is not duplicated, so I'm not worried about code that accounts for the header being present.

Here is an illustration of what the data look like:

header,cat,Everquest,mermaid
1f,2r,7g,8c
xc,7f,66,rp
Kf,87,gH,||
hy,7f,&&,--
rr,2r,89,))
v6,2r,^&,!c
92,@r,hd,m
2r,2r,2r,2r
7f,7f,7f,7f
9,10,11,12
7f,2r,7f,7f
76,@r,88,u|

I am seeking the output:

7f
@r

as both of these are duplicated in column two. As you can see, 2r is also duplicated, but it is commonly duplicated and I know it, so I just want to ignore it.

To be clear, I can't know the values of the duplicates other than the common one, which, in my real data files, is actually the word 'none'. It's '2r' above.

I read here that I can do something like

awk -F, ' ++A[$2] > 1 { print $2; exit 1 } ' input.file

However, I cannot figure out how to skip '2r' nor what ++A means.

I have read the awk manual, but I am afraid I find it a little confusing with respect to the question I am asking.

Additionally,

uniq -d

looks promising based on a few other questions and answers, but I am still unsure how to skip over the value that I want to ignore.

Thank you in advance for you help.

James Brown · Accepted Answer

how to skip '2r':

$ awk -F, ' ++a[$2] == 2 && $2 != "2r" { print $2 } ' file
7f
@r

++a[$2] adds an element to a hash array and increases its value by 1, ie counts how many occurrences of each value in the second column exist.

Unix: Find duplicate occurrences in column in csv file, omit one possible value

Answers (2)

Related Questions