Reputation: 149
I have a tab-separated fileA that looks like this:
seqnameAa_len_240 seqnameBa_len_247
seqnameAb_len_881 seqnameBb_len_719
seqnameAc_len_736,seqnameAd_len_640 seqnameBc_len_489
seqnameAe_len_241 seqnameBd_len_302,seqnameBe_len_465
seqnameAf_len_436,seqnameAf_len_620 seqnameBf_len_452,seqnameBg_len_435
Sequences on the left are from one dataset and sequences from the right are from another. Each row reflects one group of similar sequences. In some cases, there are more than one sequence from one, the other or both datasets that belong to the same sequence group (reflected by several sequences delineated by commas in one column).
For each row, I would like to find a way to find the largest value for each of the two datasets giving the following output.
240 247
881 719
736 489
241 465
620 452
I thought about make a for loop over all the rows, and then for each row replace comma with a newline, then remove all the text and just keep the numbers and select the largest value per column with awk. But with my current bash/awk knowledge that would have to be done column-wise, and there is not a set number of comma-separated entries per cell and I am not sure how to do that.
Is there a simpler way of getting the above output from fileA?
Upvotes: 0
Views: 61
Reputation: 203324
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
for (fldNr=1; fldNr<=NF; fldNr++) {
split($fldNr,fldArr,/,/)
for (sfNr=1; sfNr in fldArr; sfNr++) {
sub(/.*_/,"",fldArr[sfNr])
max = ( (sfNr==1)||(fldArr[sfNr]>max) ? fldArr[sfNr] : max)
}
$fldNr = max
}
print
}
$ awk -f tst.awk file
240 247
881 719
736 489
241 465
620 452
Upvotes: 1
Reputation: 44023
I'd use some gawk trickery to achieve this without manual splitting:
gawk -F , -v RS='[\t\n]' '{ m = 0; for(i = 1; i <= NF; ++i) { sub(/.*_/, "", $i); if($i > m) { m = $i } } printf m RT }'
The trick is to use tabs and newlines as record separators, so that a record is no longer a line but what would otherwise be a field (such as seqnameAf_len_436,seqnameAf_len_620
), and the fields $1
, $2
and so forth are the comma-delimited subfields (because of -F ,
). Then
{
m = 0
for(i = 1; i <= NF; ++i) { # walk through the (comma-delimited) fields
sub(/.*_/, "", $i) # isolate the number
if($i > m) { # find the maximum
m = $i
}
}
printf m RT # and print it with the same record terminator
# that was in the input (tab or newline)
}
Both the use of regexes as record separator and RT
are gawk-specific.
Upvotes: 0
Reputation: 246774
perl -MList::Util=max -lane '
print max($F[0] =~ /\d+/g), "\t", max($F[1] =~ /\d+/g)
' fileA
Upvotes: 0