Shreya
Shreya

Reputation: 649

Find the repeating words in a column and delete

I have file with data

AND (CP),(D),(SE),(SI),(CP),(D),(SE),(SI)            (Q),(Q)    1
OR  (CP),(D),(E),(SE),(SI),(CP),(D),(E),(SE),(SI)    (Q),(Q)    1
DFF (CP),(D),(E),(CP),(D),(E)                        (QN),(QN)  1

I want output as

AND (CP),(D),(SE),(SI)          (Q)  1
OR  (CP),(D),(E),(SE),(SI)      (Q)  1
DFF (CP),(D),(E)                (QN) 1

I want to delete the repeating terms present in column 2 and column 3 eg. In first line in column 2, CP,D,SE,SI are repeated again, so it should get deleted same in 3rd column Q is repeated so repeated one should get delete.

I tried with awk

awk '!seen[$2]++' file 

But getting error can't find [

Upvotes: 0

Views: 81

Answers (3)

choroba
choroba

Reputation: 241918

If the repeated part is always exactly the same and it's repeated twice, you can use sed:

sed -E 's/ (.+),\1 / \1 /g'

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133538

Based on your shown samples, please try following. Written and tested in GNU awk. Created a function named removeDup, just pass all your field numbers where you want to remove duplicates inside " like "2,3" to remove duplicates in 2nd and 3rd fields and you are all set then.

awk '
BEGIN{ s1="," }
function removeDup(fields){
  num=split(fields,fieldNum,",")
  for(k=1;k<=num;k++){
    delete arr1
    delete arrVal1
    val1=num1=""
    num1=split($fieldNum[k],arr1,",")
    for(i=1;i<=num1;i++){
      if(!arrVal1[arr1[i]]++){
        val1=(val1?val1 s1:"")arr1[i]
      }
    }
    $fieldNum[k]=val1
  }
}
{
  removeDup("2,3")
}
1
' Input_file

Explanation: Adding detailed explanation for above.

awk '                                   ##Starting awk program from here.
BEGIN{ s1="," }                         ##Setting s1 value to comma in BEGIN section.
function removeDup(fields){             ##Creating function removeDup passing fields to it.
  num=split(fields,fieldNum,",")        ##Splitting fields into fieldNum array here.
  for(k=1;k<=num;k++){                  ##Running for loop till value of num here.
    delete arr1                         ##Deleting arr1 here.
    delete arrVal1                      ##Deleting arrVal1 here.
    val1=num1=""                        ##Nullify val1 and num1 here.
    num1=split($fieldNum[k],arr1,",")   ##Splitting field(fieldNum value) into arr1 here.
    for(i=1;i<=num1;i++){               ##Running for loop till value of num1 here.
      if(!arrVal1[arr1[i]]++){          ##Checking condition if current arr1 values is NOT present in arrVal1 then do following.
        val1=(val1?val1 s1:"")arr1[i]   ##Creating val1 here and keep on adding value to it.
      }
    }
    $fieldNum[k]=val1                   ##Assigning currnet field value as val1 value here.
  }
}
{
  removeDup("2,3")                      ##Calling removeDup function in main program with 2nd and 3rd field numbers passed to it.
}
1
' Input_file                            ##mentioning Input_file name here.

Upvotes: 1

anubhava
anubhava

Reputation: 785276

You may use this awk:

awk 'function dedup(col,   a, seen, i, s) {split($col, a, /,/); s=""; for (i=1; i in a; ++i) if (!seen[a[i]]++) s = s (s == "" ? "" : ",") a[i]; $col=s;} {dedup(2); dedup(3)} 1' file | column -t

AND  (CP),(D),(SE),(SI)      (Q)   1
OR   (CP),(D),(E),(SE),(SI)  (Q)   1
DFF  (CP),(D),(E)            (QN)  1

Expanded form:

awk 'function dedup(col,   a, seen, i, s) {
   split($col, a, /,/)
   s = ""
   for (i=1; i in a; ++i)
      if (!seen[a[i]]++)
         s = s (s == "" ? "" : ",") a[i]
    $col = s
}
{
   dedup(2)
   dedup(3)
} 1' file | column -t

Used column -t for tabular output only.

Upvotes: 3

Related Questions