Frog
Frog

Reputation: 39

How to subtract values of a specific row value from all the other row values?

My current working file is like this

ID   Time   A_in   Time  B_in  Time  C_in
Ax   0.1    10     0.1   15    0.1   45  
By   0.2    12     0.2   35    0.2   30  
Cz   0.3    20     0.3   20    0.3   15  
Fr   0.4    35     0.4   15    0.4   05  
Exp  0.5    10     0.5   25    0.5   10

My columns of interest are those with "_in" header. In those columns, I want to subtract the values of all Row elements from the row element that start with ID "Exp". Lets consider A_in column, where the "Exp" row value is 10. So I want to subtract 10 from all the other elements of that A_in column

My amateur code is like this (I know it is silly)

#This part is grabbing all the values in ```Exp``` row
Exp=$( awk 'BEGIN{OFS="\t";
            PROCINFO["sorted_in"] = "@val_num_asc"}
    FNR==1 { for (n=2;n<=NF;n++) { if ($n ~ /_GasOut$/) cols[$n]=n; }}
    /Exp/ {
           for (c in cols){
           shift = $cols[c]
           printf shift" "
           }
       }

        ' File.txt |paste -sd " ") 
Exp_array=($Exp)

z=1
for i in "${Exp_array[@]}"
do
z=$(echo 2+$z | bc -l)
Exp_point=$i
awk  -vd="$Exp_point" -vloop="$z" -v  '
            BEGIN{OFS="\t";
            PROCINFO["sorted_in"] = "@val_num_asc"}
            function abs(x) {return x<0?-x:x}
            FNR==1 { for (n=2;n<=NF;n++) { if ($n ~ /_GasOut$/) cols[$n]=n; }}
        NR>2{
            $loop=abs($loop-d); print
            }
         ' File.txt
done

My First desired outcome is this

ID   Time   A_in   Time  B_in  Time  C_in
Ax   0.1    0.0    0.1   10    0.1   35  
By   0.2    02     0.2   10    0.2   20  
Cz   0.3    10     0.3   05    0.3   05  
Fr   0.4    25     0.4   10    0.4   05  
Exp  0.5    0.0    0.5   0.0   0.5  0.0

Now from each "_in" columns I want to find the corresponding ID of 2 smallest values. So My second desired outcome is

A_in   B_in  C_in
Ax     Cz    Cz 
By     Exp   Fr 
Exp          Exp

Upvotes: 0

Views: 491

Answers (3)

Ed Morton
Ed Morton

Reputation: 203229

Please just ask 1 question at a time. Here's how to do the first thing you asked about:

$ cat tst.awk
BEGIN   { OFS="\t" }
NR==FNR { if ($1=="Exp") split($0,exps); next }
FNR==1  { $1=$1; print; next }
{
    for (i=1; i<=NF; i++) {
        val = ( (i-1) % 2 ? $i : exps[i] - $i )
        printf "%s%s", (val < 0 ? -val : val), (i<NF ? OFS : ORS)
    }
}

$ awk -f tst.awk file file
ID      Time    A_in    Time    B_in    Time    C_in
0       0.1     0       0.1     10      0.1     35
0       0.2     2       0.2     10      0.2     20
0       0.3     10      0.3     5       0.3     5
0       0.4     25      0.4     10      0.4     5
0       0.5     0       0.5     0       0.5     0

The above will work efficiently and robustly using any awk in any shell on every UNIX box.

If after reading this, re-reading the previous awk answers you've received, and looking up the awk man page you still need help with the 2nd thing you asked about then ask a new standalone question just about that.

Upvotes: 0

KamilCuk
KamilCuk

Reputation: 140960

After some fun and an hour or two I wrote this abomination:

cat <<EOF >file
ID   Time   A_in   Time  B_in  Time  C_in
Ax   0.1    10     0.1   15    0.1   45  
By   0.2    12     0.2   35    0.2   30  
Cz   0.3    20     0.3   20    0.3   15  
Fr   0.4    35     0.4   15    0.4   05  
Exp  0.5    10     0.5   25    0.5   10
EOF
# fix stackoverflow formatting
# input file should be separated with tabs
<file tr -s ' ' | tr ' ' '\t' > file2
mv file2 inputfile

# read headers to an array
IFS=$'\t' read -r -a hdrs < <(head -n1 inputfile)

# exp line read into an array
IFS=$'\t' read -r -a exps < <(grep -m1 $'^Exp\t' inputfile)

# column count
colcnt="${#hdrs[@]}"
if [ "$colcnt" -eq 0 ]; then 
    echo >&2 "ERROR - must be at least one column"
    exit 1
fi

# numbers of those columns which headers have _in suffix
incolnums=$(
    paste <(
        printf "%s\n" "${hdrs[@]}"
    ) <(
        # puff, the numbers will start from zero cause bash indexes arrays from zero
        # but `cut` indexes fields from 1, so.. just keep in mind it's from 0
        seq 0 $((colcnt - 1))
    ) |
    grep $'_in\t' |
    cut -f2
)

# read the input file
{
    # preserve header line
    IFS= read -r hdrline
    ( IFS=$'\t'; printf "%s\n" "$hdrline" )

    # ok. read the file field by field
    # I think we could awk here
    while IFS=$'\t' read -a vals; do

        # for each column number with _in suffix
        while IFS= read -r incolnum; do

            # update the column value
            # I use bc for float calculations
            vals[$incolnum]=$(bc <<-EOF
                define abs(i) {
                    if (i < 0) return (-i)
                    return (i)
                }
                scale=2
                abs(${vals[$incolnum]} - ${exps[$incolnum]})
EOF
            )

        done <<<"$incolnums"

        # output the line
        ( IFS=$'\t'; printf "%s\n" "${vals[*]}" )

    done

} < inputfile > MyFirstDesiredOutcomeIsThis.txt

# ok so, first part done

{
    # output headers names with _in suffix
    printf "%s\n" "${hdrs[@]}" | 
    grep '_in$' |
    tr '\n' '\t' |
    # omg, fix tr, so stupid
    sed 's/\t$/\n/'

    # puff
    # output the corresponding ID of 2 smallest values of the specified column number
    # @arg: $1 column number
    tmpf() {
        # remove header line
        <MyFirstDesiredOutcomeIsThis.txt tail -n+2 |
        # extract only this column
        cut -f$(($1 + 1)) |
        # unique numeric sort and extract two smallest values
        sort -n -u | head -n2 |
        # now, well, extract the id's that match the numbers
        # append numbers with tab (to match the separator)
        # suffix numbers with dollar (to match end of line)
        sed 's/^/\t/; s/$/$/;' |
        # how good is grep at buffering(!)
        grep -f /dev/stdin <(
            <MyFirstDesiredOutcomeIsThis.txt tail -n+2 |
            cut -f1,$(($1 + 1))
        ) |
        # extract numbers only
        cut -f1
    }

    # the following is something like foldr $'\t' $(tmpf ...) for each $incolnums
    # we need to buffer here, we are joining the output column-wise
    output=""
    while IFS= read -r incolnum; do
        output=$(<<<$output paste - <(tmpf "$incolnum"))
    done <<<"$incolnums"

    # because with start with empty $output, paste inserts leading tabs
    # remove them ... and finally output $output
    <<<"$output" cut -f2-

}  > MySecondDesiredOutcomeIs.txt

# fix formatting to post it on stackoverflow
# files have tabs, and column will output them with space
# which is just enough
echo '==> MyFirstDesiredOutcomeIsThis.txt <=='
column -t -s$'\t' MyFirstDesiredOutcomeIsThis.txt
echo
echo '==> MySecondDesiredOutcomeIs.txt <=='
column -t -s$'\t' MySecondDesiredOutcomeIs.txt

The script will output:

==> MyFirstDesiredOutcomeIsThis.txt <==
ID   Time  A_in  Time  B_in  Time  C_in
Ax   0.1   0     0.1   10    0.1   35
By   0.2   2     0.2   10    0.2   20
Cz   0.3   10    0.3   5     0.3   5
Fr   0.4   25    0.4   10    0.4   5
Exp  0.5   0     0.5   0     0.5   0

==> MySecondDesiredOutcomeIs.txt <==
A_in  B_in  C_in
Ax    Cz    Cz
By    Exp   Fr
Exp         Exp

Written and tested at tutorialspoint.

I use bash and core-/more-utils to manipulate the file. First I identify the numbers of columns ending with _in suffix. Then I buffor the value stored in the Exp line.

Then I just read a file line by line, field by field, and for each field that has the number of a column that header ends with _in suffix, I substract the field value with the field value from the exp line. I think this part should be the slowest (I use plain while IFS=$'\t' read -r -a vals), but a smart awk scripting could speed this process up. This generates your "first desired output", as you called it.

Then I need to output only the header names ending with _in suffix. Then for each column number that ends with _in suffix, I need to identify 2 smallest values in the column. I use plain sort -n -u | head -n2. Then, it get's a little tricky. I need to extract IDs that have one of the corresponding 2 smallest values in such column. This is a job for grep -f. I prepare proper regex in the input using sed and let grep -f /dev/stdin do the filtering job.

Upvotes: 1

choroba
choroba

Reputation: 241808

Perl to the rescue!

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

@ARGV = (@ARGV[0, 0]);  # Read the input file twice.

my @header = split ' ', <>;
my @in = grep $header[$_] =~ /_in$/, 0 .. $#header;
$_ = <> until eof;
my @exp = split;

my @min;
<>;
while (<>) {
    my @F = split;
    for my $i (@in) {
        $F[$i] = abs($F[$i] - $exp[$i]);
        @{ $min[$i] }[0, 1]
            = sort { $a->[0] <=> $b->[0] }
                   [$F[$i], $F[0]], grep defined, @{ $min[$i] // [] }
            unless eof;
    }
    say join "\t", @F;
}

print "\n";
say join "\t", @header[@in];
for my $index (0, 1) {
    for my $i (@in) {
        next unless $header[$i] =~ /_in$/;
        print $min[$i][$index][1], "\t";
    }
    print "\n";
}

It reads the file twice. In the first read, it just remembers the first line as the @header array and the last line as the @exp array.

In the second read, it subtracts the corresponding exp value from each _in column. It also stores the two least numbers in the @min array at the position corresponding to the column position.

Formatting the numbers (i.e. 0.0 instead of 0 and 02 instead of 2) left as an exercise to the reader. Same with redirecting the output to several different files.

Upvotes: 2

Related Questions