Steve_A
Steve_A

Reputation: 19

Deduplicate Text File with Awk but keep the lines with the non-empty field

I am trying to deduplicate the rows of a text file with Awk but prefer to keep duplicated lines with the non-empty field "f4" (unless all those dup'ed lines have a blank field "f4").

Input_File

f1|f2|f3|f4|f5
aa|bb|cc||ee
aa|bb|cc|dd|ee
aa|bb|cc|dd|ee
aa|bb|cc||ee
aaa|qq|ccc||eee
aaa|qq|ccc|zz|eee
aaa|qq|ccc|zz|eee
aaa|qq|ccc||eee
aaa|qq|ccc||eee
new|test|ccc||eee
new|test|ccc||eee

Output Needed

f2|f4
bb|dd
qq|zz
test|

Code tried (not working - getting wrong output):

awk ' BEGIN { FS=OFS="|" }
{ if ( !seen[$2, $3]++ ) print $2, $4 } '   Input_File

Wrong Output

f2|f4
bb|
qq|
test|

Upvotes: 1

Views: 69

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133610

EDIT: Since OP changed the question so adding new answer now. This will check if a 2nd field has 4th field or not, if it has it will print its unique value or if all of its occurrence are NOT having any 4th field then empty field will be printed.

awk '
BEGIN{
  FS=OFS="|"
}
FNR==NR{
  if(!a[$2]){
    a[$2]=$4
  }
  next
}
($2 in a) && $4==a[$2]{
  print $2,$4
  delete a[$2]
}'  Input_file  Input_file

Output will be as follows.

f2|f4
bb|dd
qq|zz
test|


Could you please try following.

awk 'BEGIN{FS=OFS="|"} $2 && $4{print $2,$4}' Input_file

Above will check field 2nd and 4th if they are both are NON-empty then it will print the lines, in case you want to check only 4th field then change above to:

awk 'BEGIN{FS=OFS="|"} $4{print $2,$4}' Input_file

If you want to remove duplicates and check 4th column's existence use following then.

awk 'BEGIN{FS=OFS="|"} $4 && !a[$4]++{print $2,$4}' Input_file

Upvotes: 0

dawg
dawg

Reputation: 104024

You can do:

awk 'BEGIN{FS=OFS="|"}
     $4 {print $2,$4}' file

To add the dedup logic:

awk 'BEGIN{FS=OFS="|"}
     $4 && seen[$2]++<1 {print $2,$4}' file

Upvotes: 1

Related Questions