Reputation: 155
I am hoping for a line or two of code for a bash script to find and print repeated items in a column in 2.5G csv file except for an item that I know is commonly repeated.
The data file has a header, but it is not duplicated, so I'm not worried about code that accounts for the header being present.
Here is an illustration of what the data look like:
header,cat,Everquest,mermaid
1f,2r,7g,8c
xc,7f,66,rp
Kf,87,gH,||
hy,7f,&&,--
rr,2r,89,))
v6,2r,^&,!c
92,@r,hd,m
2r,2r,2r,2r
7f,7f,7f,7f
9,10,11,12
7f,2r,7f,7f
76,@r,88,u|
I am seeking the output:
7f
@r
as both of these are duplicated in column two. As you can see, 2r is also duplicated, but it is commonly duplicated and I know it, so I just want to ignore it.
To be clear, I can't know the values of the duplicates other than the common one, which, in my real data files, is actually the word 'none'. It's '2r' above.
I read here that I can do something like
awk -F, ' ++A[$2] > 1 { print $2; exit 1 } ' input.file
However, I cannot figure out how to skip '2r' nor what ++A means.
I have read the awk manual, but I am afraid I find it a little confusing with respect to the question I am asking.
Additionally,
uniq -d
looks promising based on a few other questions and answers, but I am still unsure how to skip over the value that I want to ignore.
Thank you in advance for you help.
Upvotes: 2
Views: 4783
Reputation: 37394
how to skip '2r':
$ awk -F, ' ++a[$2] == 2 && $2 != "2r" { print $2 } ' file
7f
@r
++a[$2]
adds an element to a hash array and increases its value by 1, ie counts how many occurrences of each value in the second column exist.
Upvotes: 4
Reputation: 58778
cut -d, -f2
sort
uniq -d
to get repeated linesgrep -Fv 2r
to exclude a value, or grep -Fv -e foo -e bar …
to exclude multiple valuesIn other words something like this:
cut -d, -f2 input.csv | sort | uniq -d | grep -Fv 2r
Depending on the data it might be faster if you move grep
earlier in the pipeline, but you should verify that with some benchmarking.
Upvotes: 1