Print rows whose first field appears exactly twice in the file

Question

I have a file like this:

91052011868;Export Equi_Fort Postal;EXPORT;23/02/2015;1;0;0
91052011868;Sof_equi_Fort_Email_am_%yyyy%%mm%%dd%;EMAIL;19/02/2015;1;0;0
91052011868;Sof_trav_Fort_Email_am_%yyyy%%mm%%dd%;EMAIL;19/02/2015;1;0;0
91052151371;Export Trav_faible temoin;EXPORT;12/02/2015;1;0;0
91052182019;Export Deme_fort temoin;EXPORT;24/02/2015;1;0;0
91052199517;Sof_voya_Faible_Email_pm;EMAIL;22/01/2015;1;0;0
91052199517;Sof_voya_Faible_Email_Relance_pm;EMAIL;26/01/2015;1;0;0
91052262558;Sof_deme_faible_Email_am;EMAIL;26/01/2015;1;0;1
91052265940;Sof_trav_Faible_Email_am_%yyyy%%mm%%dd%;EMAIL;13/02/2015;1;0;0
91052265940;Sof_trav_Faible_Email_Relance_am_%yyyy%%mm%%dd%;EMAIL;17/02/2015;1;0;0
91052265940;Sof_voya_Faible_Email_am_%yyyy%%mm%%dd%;EMAIL;13/02/2015;1;0;0
91052265940;Sof_voya_Faible_Email_Relance_am_%yyyy%%mm%%dd%;EMAIL;16/02/2015;1;0;0
91052531428;Export Trav_faible temoin;EXPORT;11/02/2015;1;0;0
91052547697;Export Deme_Faible Postal;EXPORT;27/02/2015;1;0;0
91052562398;Export Deme_faible temoin;EXPORT;18/02/2015;1;0;0

I want to know all the lines where the first column duplicated values are greater than 1 but strictly inferior to 3.

91052199517;Sof_voya_Faible_Email_pm;EMAIL;22/01/2015;1;0;0
91052199517;Sof_voya_Faible_Email_Relance_pm;EMAIL;26/01/2015;1;0;0

I did the part below but it doesn't work...

 sort file | awk 'NR==FNR{a[$1]++;next;}{ if (a[$1] > 0 && a[$1] <1 )print $0;}' file file

Why?

fedorqui · Accepted Answer

If what you want is to print all those lines whose first field appears twice, you can use this:

$ awk -F";" 'FNR==NR{a[$1]++; next} a[$1]==2' file file
91052199517;Sof_voya_Faible_Email_pm;EMAIL;22/01/2015;1;0;0
91052199517;Sof_voya_Faible_Email_Relance_pm;EMAIL;26/01/2015;1;0;0

This sets the field separator to the semi colon and then reads the file twice: - the first time to count how many the 1st field appears (a[$1]++) - the second time to print those lines matching the condition a[$1]==2. That is, the first field to appearing twice throughout the file.

If you wanted those indexes appearing between 2 and 4 times, you could use the following syntax on the second block:

a[$1]>=2 && a[$1]<=4

Why wasn't your approach working?

Because your condition says:

if (a[$1] > 0 && a[$1] <1 )

which of course will never happen, since a[$1] is an integer and no integer is bigger than 0 and smaller than 1.

Note my proposed solution uses the same idea, only that in a bit more idiomatic way: There is no need to be explicit in the if condition, neither saying print $0: this is exactly what awk does when a condition evaluates as True.

Print rows whose first field appears exactly twice in the file

Answers (1)

Why wasn't your approach working?

Related Questions