17th Lvl Botanist
17th Lvl Botanist

Reputation: 155

Unix: Find duplicate occurrences in column in csv file, omit one possible value

I am hoping for a line or two of code for a bash script to find and print repeated items in a column in 2.5G csv file except for an item that I know is commonly repeated.

The data file has a header, but it is not duplicated, so I'm not worried about code that accounts for the header being present.

Here is an illustration of what the data look like:

header,cat,Everquest,mermaid
1f,2r,7g,8c
xc,7f,66,rp
Kf,87,gH,||
hy,7f,&&,--
rr,2r,89,))
v6,2r,^&,!c
92,@r,hd,m
2r,2r,2r,2r
7f,7f,7f,7f
9,10,11,12
7f,2r,7f,7f
76,@r,88,u|

I am seeking the output:

7f
@r

as both of these are duplicated in column two. As you can see, 2r is also duplicated, but it is commonly duplicated and I know it, so I just want to ignore it.

To be clear, I can't know the values of the duplicates other than the common one, which, in my real data files, is actually the word 'none'. It's '2r' above.

I read here that I can do something like

awk -F, ' ++A[$2] > 1 { print $2; exit 1 } ' input.file

However, I cannot figure out how to skip '2r' nor what ++A means.

I have read the awk manual, but I am afraid I find it a little confusing with respect to the question I am asking.

Additionally,

uniq -d 

looks promising based on a few other questions and answers, but I am still unsure how to skip over the value that I want to ignore.

Thank you in advance for you help.

Upvotes: 2

Views: 4783

Answers (2)

James Brown
James Brown

Reputation: 37394

how to skip '2r':

$ awk -F, ' ++a[$2] == 2 && $2 != "2r" { print $2 } ' file
7f
@r

++a[$2] adds an element to a hash array and increases its value by 1, ie counts how many occurrences of each value in the second column exist.

Upvotes: 4

l0b0
l0b0

Reputation: 58778

  1. Get only the second column using cut -d, -f2
  2. sort
  3. uniq -d to get repeated lines
  4. grep -Fv 2r to exclude a value, or grep -Fv -e foo -e bar … to exclude multiple values

In other words something like this:

cut -d, -f2 input.csv | sort | uniq -d | grep -Fv 2r

Depending on the data it might be faster if you move grep earlier in the pipeline, but you should verify that with some benchmarking.

Upvotes: 1

Related Questions