EKarl
EKarl

Reputation: 149

bash/awk: Getting largest value per cell

I have a tab-separated fileA that looks like this:

seqnameAa_len_240                     seqnameBa_len_247
seqnameAb_len_881                     seqnameBb_len_719
seqnameAc_len_736,seqnameAd_len_640   seqnameBc_len_489
seqnameAe_len_241                     seqnameBd_len_302,seqnameBe_len_465
seqnameAf_len_436,seqnameAf_len_620   seqnameBf_len_452,seqnameBg_len_435

Sequences on the left are from one dataset and sequences from the right are from another. Each row reflects one group of similar sequences. In some cases, there are more than one sequence from one, the other or both datasets that belong to the same sequence group (reflected by several sequences delineated by commas in one column).

For each row, I would like to find a way to find the largest value for each of the two datasets giving the following output.

240    247
881    719
736    489
241    465
620    452

I thought about make a for loop over all the rows, and then for each row replace comma with a newline, then remove all the text and just keep the numbers and select the largest value per column with awk. But with my current bash/awk knowledge that would have to be done column-wise, and there is not a set number of comma-separated entries per cell and I am not sure how to do that.

Is there a simpler way of getting the above output from fileA?

Upvotes: 0

Views: 61

Answers (3)

Ed Morton
Ed Morton

Reputation: 203324

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    for (fldNr=1; fldNr<=NF; fldNr++) {
        split($fldNr,fldArr,/,/)
        for (sfNr=1; sfNr in fldArr; sfNr++) {
            sub(/.*_/,"",fldArr[sfNr])
            max = ( (sfNr==1)||(fldArr[sfNr]>max) ? fldArr[sfNr] : max)
        }
        $fldNr = max
    }
    print
}

$ awk -f tst.awk file
240     247
881     719
736     489
241     465
620     452

Upvotes: 1

Wintermute
Wintermute

Reputation: 44023

I'd use some gawk trickery to achieve this without manual splitting:

gawk -F , -v RS='[\t\n]' '{ m = 0; for(i = 1; i <= NF; ++i) { sub(/.*_/, "", $i); if($i > m) { m = $i } } printf m RT }'

The trick is to use tabs and newlines as record separators, so that a record is no longer a line but what would otherwise be a field (such as seqnameAf_len_436,seqnameAf_len_620), and the fields $1, $2 and so forth are the comma-delimited subfields (because of -F ,). Then

{
  m = 0
  for(i = 1; i <= NF; ++i) { # walk through the (comma-delimited) fields 
    sub(/.*_/, "", $i)       # isolate the number
    if($i > m) {             # find the maximum
      m = $i
    }
  }
  printf m RT                # and print it with the same record terminator
                             # that was in the input (tab or newline)
}

Both the use of regexes as record separator and RT are gawk-specific.

Upvotes: 0

glenn jackman
glenn jackman

Reputation: 246774

perl -MList::Util=max -lane '
    print max($F[0] =~ /\d+/g), "\t", max($F[1] =~ /\d+/g)
' fileA

Upvotes: 0

Related Questions