austin7923
austin7923

Reputation: 95

Remove repeated string from every column

Trying to remove duplicates from each rows after ","

Input:
rs10993127  9:94266397-94266397,9:94266397-94266397 intron_variant,intron_variant,non_coding_transcript_variant ZNF169,ZNF169
rs11533012  9:94267817-94267817,9:94267817-94267817 intron_variant,intron_variant,non_coding_transcript_variant ZNF169,ZNF169

Desired output:
rs10993127 9:94266397-94266397 intron_variant,non_coding_transcript_variant ZNF169
rs11533012 9:94267817-94267817 intron_variant,non_coding_transcript_variant ZNF169

My codes:
awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s%s",$i,FS)}{printf("\n")}'

Thank you!

Upvotes: 0

Views: 92

Answers (3)

Daemon Painter
Daemon Painter

Reputation: 3470

One liner alternative based on the assumption:

awk '{output="";for(f=1;f<=NF;f++){split($f,a,",");output=output" "a[1]}print output}'

output:

 rs10993127 9:94266397-94266397 intron_variant ZNF169
 rs11533012 9:94267817-94267817 intron_variant ZNF169

known issue is that it happens a whitespace before the first field.

Upvotes: -1

kvantour
kvantour

Reputation: 26481

The method below does not assume that duplicates are consecutive

awk '{ for(i=1;i<=NF;++i) { 
         n=split($i,a,",");
         for(j=1;j<=n;++j) {
            s = s (a[j] in b ? "" : (s ? "," : "")  a[j])
            b[a[j]]
         }
         $i=s; s=""; delete b
     }}1' file

Which returns the output:

rs10993127 9:94266397-94266397 intron_variant,non_coding_transcript_variant ZNF169
rs11533012 9:94267817-94267817 intron_variant,non_coding_transcript_variant ZNF169

The idea in the above is to rebuild each field. Each field is split into various entries using split and stored in the array a. When rebuilding the field, we check if an entry a[j] has already been added to the new value s of the field. This check is done by validating if a key of the associative array b exists with the same value of the current processed entry (a[j] in b).

Upvotes: 2

Sundeep
Sundeep

Reputation: 23667

With GNU sed and other implementations that support \b

$ sed -E 's/\b([^,]+),\1\b/\1/g' ip.txt
rs10993127  9:94266397-94266397 intron_variant,non_coding_transcript_variant ZNF169
rs11533012  9:94267817-94267817 intron_variant,non_coding_transcript_variant ZNF169
  • ([^,]+) match non , characters
  • ,\1 match , and text that was captured with ([^,]+)
  • \1 also helps in replacement

Word boundaries are need to avoid partial matches, for example:

$ echo 'a bc,bcd 123,23' | sed -E 's/([^,]+),\1/\1/g'
a bcd 123
$ echo 'a bc,bcd 123,23' | sed -E 's/\b([^,]+),\1\b/\1/g'
a bc,bcd 123,23

If the column content can start/end with non-word characters like : then the above solution will not work if there are partial matches.

Upvotes: 0

Related Questions