Reputation: 19
I am trying to deduplicate the rows of a text file with Awk but prefer to keep duplicated lines with the non-empty field "f4" (unless all those dup'ed lines have a blank field "f4").
Input_File
f1|f2|f3|f4|f5
aa|bb|cc||ee
aa|bb|cc|dd|ee
aa|bb|cc|dd|ee
aa|bb|cc||ee
aaa|qq|ccc||eee
aaa|qq|ccc|zz|eee
aaa|qq|ccc|zz|eee
aaa|qq|ccc||eee
aaa|qq|ccc||eee
new|test|ccc||eee
new|test|ccc||eee
Output Needed
f2|f4
bb|dd
qq|zz
test|
Code tried (not working - getting wrong output):
awk ' BEGIN { FS=OFS="|" }
{ if ( !seen[$2, $3]++ ) print $2, $4 } ' Input_File
Wrong Output
f2|f4
bb|
qq|
test|
Upvotes: 1
Views: 69
Reputation: 133610
EDIT: Since OP changed the question so adding new answer now. This will check if a 2nd field has 4th field or not, if it has it will print its unique value or if all of its occurrence are NOT having any 4th field then empty field will be printed.
awk '
BEGIN{
FS=OFS="|"
}
FNR==NR{
if(!a[$2]){
a[$2]=$4
}
next
}
($2 in a) && $4==a[$2]{
print $2,$4
delete a[$2]
}' Input_file Input_file
Output will be as follows.
f2|f4
bb|dd
qq|zz
test|
Could you please try following.
awk 'BEGIN{FS=OFS="|"} $2 && $4{print $2,$4}' Input_file
Above will check field 2nd and 4th if they are both are NON-empty then it will print the lines, in case you want to check only 4th field then change above to:
awk 'BEGIN{FS=OFS="|"} $4{print $2,$4}' Input_file
If you want to remove duplicates and check 4th column's existence use following then.
awk 'BEGIN{FS=OFS="|"} $4 && !a[$4]++{print $2,$4}' Input_file
Upvotes: 0
Reputation: 104024
You can do:
awk 'BEGIN{FS=OFS="|"}
$4 {print $2,$4}' file
To add the dedup logic:
awk 'BEGIN{FS=OFS="|"}
$4 && seen[$2]++<1 {print $2,$4}' file
Upvotes: 1