Reputation: 321
I have a input data like:
chr17 41243232 41243373 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23
chr13 32954112 32954208 BRCA2_ex24
And I need to check for duplicates rows $2
and $3
row, if is duplicated, I need to merged into one line and $4
column print as comma separated.
Output:
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
Is there any AWK solution to easy process this kind a data? I would appreciate explained solution. Input and output are tab-separated formats. NOTE: First, second and third fields are allays equal.
My try was:
awk -v OFS="\t" '{i=$2 FS $1 FS $3 FS $4} {a[i]=!a[i]?$4:a[i] "," $4} END {for (l in a) {print l,a[l]}}' infile
Thank you for any ideas.
Upvotes: 1
Views: 97
Reputation: 203089
$ cat tst.awk
{
curr = $2 FS $3
if (curr == prev) {
buf = buf "," $NF
}
else {
if (NR>1) {
print buf
}
buf = $0
}
prev = curr
}
END { print buf }
$ awk -f tst.awk file
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
The differences between this and @JamesBrown's solution are:
Upvotes: 1
Reputation: 23667
if perl
is okay:
$ cat ip.txt
chr17 41243232 41243373 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23
chr13 32954112 32954208 BRCA2_ex24
$ perl -ale '$k = join "\t",@F[0..2]; $h{$k} .= $h{$k} ? ",$F[3]" : $F[3]; END{ print "$_\t$h{$_}" foreach (keys %h) }' ip.txt
chr17 41243639 41243811 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
-ale
split input line on spaces and save to @F
array, strip newlines from input line and add newline for print statements$k = join "\t",@F[0..2]
key to work with - first 3 elements joined by tab
$h{$k} .= $h{$k} ? ",$F[3]" : $F[3]
append values to hash variable, add ,
depending on existing value is empty or notEND{ print "$_\t$h{$_}" foreach (keys %h) }
after all lines are processed, print key and value separated by tab
. Order of keys is randomAlternate way by extracting key, value using regex:
$ perl -nle '($k,$v)=/^(.*?)\s+(\S+)$/; $h{$k} .= $h{$k} ? ",$v" : $v; END{print "$_\t$h{$_}" foreach (keys %h) }' ip.txt
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
chr17 41243639 41243811 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
Upvotes: 1
Reputation: 9571
Just replace the first assignment with
i=$1 FS $2 FS $3
And possibly filter the output through sed
to replace spaces with tabs:
... | sed 's/ / /g'
space---^ ^--- TAB
Output:
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
chr17 41243639 41243811 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243471 41243644 BRCA1_ex11
Upvotes: 1
Reputation: 37394
$ cat script.awk
{
a[$2 OFS $3] = $1 # store $1, last instance
b[$2 OFS $3] = b[$2 FS $3] $4 "," # append the $4s
}
END {
for (i in a) { # order is awk default
sub(/,$/, "", b[i]) # remove trailing ","
print a[i], i, b[i] # print
}
}
Run:
$ awk -f script.awk infile
chr17 41243471 41243644 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24
Upvotes: 2