Reputation: 95
Trying to remove duplicates from each rows after ","
Input:
rs10993127 9:94266397-94266397,9:94266397-94266397 intron_variant,intron_variant,non_coding_transcript_variant ZNF169,ZNF169
rs11533012 9:94267817-94267817,9:94267817-94267817 intron_variant,intron_variant,non_coding_transcript_variant ZNF169,ZNF169
Desired output:
rs10993127 9:94266397-94266397 intron_variant,non_coding_transcript_variant ZNF169
rs11533012 9:94267817-94267817 intron_variant,non_coding_transcript_variant ZNF169
My codes:
awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s%s",$i,FS)}{printf("\n")}'
Thank you!
Upvotes: 0
Views: 92
Reputation: 3470
One liner alternative based on the assumption:
awk '{output="";for(f=1;f<=NF;f++){split($f,a,",");output=output" "a[1]}print output}'
output:
rs10993127 9:94266397-94266397 intron_variant ZNF169
rs11533012 9:94267817-94267817 intron_variant ZNF169
known issue is that it happens a whitespace before the first field.
Upvotes: -1
Reputation: 26481
The method below does not assume that duplicates are consecutive
awk '{ for(i=1;i<=NF;++i) {
n=split($i,a,",");
for(j=1;j<=n;++j) {
s = s (a[j] in b ? "" : (s ? "," : "") a[j])
b[a[j]]
}
$i=s; s=""; delete b
}}1' file
Which returns the output:
rs10993127 9:94266397-94266397 intron_variant,non_coding_transcript_variant ZNF169
rs11533012 9:94267817-94267817 intron_variant,non_coding_transcript_variant ZNF169
The idea in the above is to rebuild each field. Each field is split into various entries using split
and stored in the array a
. When rebuilding the field, we check if an entry a[j]
has already been added to the new value s
of the field. This check is done by validating if a key of the associative array b
exists with the same value of the current processed entry (a[j] in b
).
Upvotes: 2
Reputation: 23667
With GNU sed
and other implementations that support \b
$ sed -E 's/\b([^,]+),\1\b/\1/g' ip.txt
rs10993127 9:94266397-94266397 intron_variant,non_coding_transcript_variant ZNF169
rs11533012 9:94267817-94267817 intron_variant,non_coding_transcript_variant ZNF169
([^,]+)
match non ,
characters,\1
match ,
and text that was captured with ([^,]+)
\1
also helps in replacementWord boundaries are need to avoid partial matches, for example:
$ echo 'a bc,bcd 123,23' | sed -E 's/([^,]+),\1/\1/g'
a bcd 123
$ echo 'a bc,bcd 123,23' | sed -E 's/\b([^,]+),\1\b/\1/g'
a bc,bcd 123,23
If the column content can start/end with non-word characters like :
then the above solution will not work if there are partial matches.
Upvotes: 0