Reputation: 39
My current working file is like this
ID Time A_in Time B_in Time C_in
Ax 0.1 10 0.1 15 0.1 45
By 0.2 12 0.2 35 0.2 30
Cz 0.3 20 0.3 20 0.3 15
Fr 0.4 35 0.4 15 0.4 05
Exp 0.5 10 0.5 25 0.5 10
My columns of interest are those with "_in"
header. In those columns, I want to subtract the values of all Row elements from the row element that start with ID "Exp".
Lets consider A_in
column, where the "Exp"
row value is 10. So I want to subtract 10 from all the other elements of that A_in
column
My amateur code is like this (I know it is silly)
#This part is grabbing all the values in ```Exp``` row
Exp=$( awk 'BEGIN{OFS="\t";
PROCINFO["sorted_in"] = "@val_num_asc"}
FNR==1 { for (n=2;n<=NF;n++) { if ($n ~ /_GasOut$/) cols[$n]=n; }}
/Exp/ {
for (c in cols){
shift = $cols[c]
printf shift" "
}
}
' File.txt |paste -sd " ")
Exp_array=($Exp)
z=1
for i in "${Exp_array[@]}"
do
z=$(echo 2+$z | bc -l)
Exp_point=$i
awk -vd="$Exp_point" -vloop="$z" -v '
BEGIN{OFS="\t";
PROCINFO["sorted_in"] = "@val_num_asc"}
function abs(x) {return x<0?-x:x}
FNR==1 { for (n=2;n<=NF;n++) { if ($n ~ /_GasOut$/) cols[$n]=n; }}
NR>2{
$loop=abs($loop-d); print
}
' File.txt
done
My First desired outcome is this
ID Time A_in Time B_in Time C_in
Ax 0.1 0.0 0.1 10 0.1 35
By 0.2 02 0.2 10 0.2 20
Cz 0.3 10 0.3 05 0.3 05
Fr 0.4 25 0.4 10 0.4 05
Exp 0.5 0.0 0.5 0.0 0.5 0.0
Now from each "_in"
columns I want to find the corresponding ID of 2 smallest values. So
My second desired outcome is
A_in B_in C_in
Ax Cz Cz
By Exp Fr
Exp Exp
Upvotes: 0
Views: 491
Reputation: 203229
Please just ask 1 question at a time. Here's how to do the first thing you asked about:
$ cat tst.awk
BEGIN { OFS="\t" }
NR==FNR { if ($1=="Exp") split($0,exps); next }
FNR==1 { $1=$1; print; next }
{
for (i=1; i<=NF; i++) {
val = ( (i-1) % 2 ? $i : exps[i] - $i )
printf "%s%s", (val < 0 ? -val : val), (i<NF ? OFS : ORS)
}
}
$ awk -f tst.awk file file
ID Time A_in Time B_in Time C_in
0 0.1 0 0.1 10 0.1 35
0 0.2 2 0.2 10 0.2 20
0 0.3 10 0.3 5 0.3 5
0 0.4 25 0.4 10 0.4 5
0 0.5 0 0.5 0 0.5 0
The above will work efficiently and robustly using any awk in any shell on every UNIX box.
If after reading this, re-reading the previous awk answers you've received, and looking up the awk man page you still need help with the 2nd thing you asked about then ask a new standalone question just about that.
Upvotes: 0
Reputation: 140960
After some fun and an hour or two I wrote this abomination:
cat <<EOF >file
ID Time A_in Time B_in Time C_in
Ax 0.1 10 0.1 15 0.1 45
By 0.2 12 0.2 35 0.2 30
Cz 0.3 20 0.3 20 0.3 15
Fr 0.4 35 0.4 15 0.4 05
Exp 0.5 10 0.5 25 0.5 10
EOF
# fix stackoverflow formatting
# input file should be separated with tabs
<file tr -s ' ' | tr ' ' '\t' > file2
mv file2 inputfile
# read headers to an array
IFS=$'\t' read -r -a hdrs < <(head -n1 inputfile)
# exp line read into an array
IFS=$'\t' read -r -a exps < <(grep -m1 $'^Exp\t' inputfile)
# column count
colcnt="${#hdrs[@]}"
if [ "$colcnt" -eq 0 ]; then
echo >&2 "ERROR - must be at least one column"
exit 1
fi
# numbers of those columns which headers have _in suffix
incolnums=$(
paste <(
printf "%s\n" "${hdrs[@]}"
) <(
# puff, the numbers will start from zero cause bash indexes arrays from zero
# but `cut` indexes fields from 1, so.. just keep in mind it's from 0
seq 0 $((colcnt - 1))
) |
grep $'_in\t' |
cut -f2
)
# read the input file
{
# preserve header line
IFS= read -r hdrline
( IFS=$'\t'; printf "%s\n" "$hdrline" )
# ok. read the file field by field
# I think we could awk here
while IFS=$'\t' read -a vals; do
# for each column number with _in suffix
while IFS= read -r incolnum; do
# update the column value
# I use bc for float calculations
vals[$incolnum]=$(bc <<-EOF
define abs(i) {
if (i < 0) return (-i)
return (i)
}
scale=2
abs(${vals[$incolnum]} - ${exps[$incolnum]})
EOF
)
done <<<"$incolnums"
# output the line
( IFS=$'\t'; printf "%s\n" "${vals[*]}" )
done
} < inputfile > MyFirstDesiredOutcomeIsThis.txt
# ok so, first part done
{
# output headers names with _in suffix
printf "%s\n" "${hdrs[@]}" |
grep '_in$' |
tr '\n' '\t' |
# omg, fix tr, so stupid
sed 's/\t$/\n/'
# puff
# output the corresponding ID of 2 smallest values of the specified column number
# @arg: $1 column number
tmpf() {
# remove header line
<MyFirstDesiredOutcomeIsThis.txt tail -n+2 |
# extract only this column
cut -f$(($1 + 1)) |
# unique numeric sort and extract two smallest values
sort -n -u | head -n2 |
# now, well, extract the id's that match the numbers
# append numbers with tab (to match the separator)
# suffix numbers with dollar (to match end of line)
sed 's/^/\t/; s/$/$/;' |
# how good is grep at buffering(!)
grep -f /dev/stdin <(
<MyFirstDesiredOutcomeIsThis.txt tail -n+2 |
cut -f1,$(($1 + 1))
) |
# extract numbers only
cut -f1
}
# the following is something like foldr $'\t' $(tmpf ...) for each $incolnums
# we need to buffer here, we are joining the output column-wise
output=""
while IFS= read -r incolnum; do
output=$(<<<$output paste - <(tmpf "$incolnum"))
done <<<"$incolnums"
# because with start with empty $output, paste inserts leading tabs
# remove them ... and finally output $output
<<<"$output" cut -f2-
} > MySecondDesiredOutcomeIs.txt
# fix formatting to post it on stackoverflow
# files have tabs, and column will output them with space
# which is just enough
echo '==> MyFirstDesiredOutcomeIsThis.txt <=='
column -t -s$'\t' MyFirstDesiredOutcomeIsThis.txt
echo
echo '==> MySecondDesiredOutcomeIs.txt <=='
column -t -s$'\t' MySecondDesiredOutcomeIs.txt
The script will output:
==> MyFirstDesiredOutcomeIsThis.txt <==
ID Time A_in Time B_in Time C_in
Ax 0.1 0 0.1 10 0.1 35
By 0.2 2 0.2 10 0.2 20
Cz 0.3 10 0.3 5 0.3 5
Fr 0.4 25 0.4 10 0.4 5
Exp 0.5 0 0.5 0 0.5 0
==> MySecondDesiredOutcomeIs.txt <==
A_in B_in C_in
Ax Cz Cz
By Exp Fr
Exp Exp
Written and tested at tutorialspoint.
I use bash and core-/more-utils to manipulate the file. First I identify the numbers of columns ending with _in
suffix. Then I buffor the value stored in the Exp
line.
Then I just read a file line by line, field by field, and for each field that has the number of a column that header ends with _in
suffix, I substract the field value with the field value from the exp
line. I think this part should be the slowest (I use plain while IFS=$'\t' read -r -a vals
), but a smart awk
scripting could speed this process up. This generates your "first desired output", as you called it.
Then I need to output only the header names ending with _in
suffix. Then for each column number that ends with _in
suffix, I need to identify 2 smallest values in the column. I use plain sort -n -u | head -n2
. Then, it get's a little tricky. I need to extract IDs that have one of the corresponding 2 smallest values in such column. This is a job for grep -f
. I prepare proper regex in the input using sed
and let grep -f /dev/stdin
do the filtering job.
Upvotes: 1
Reputation: 241808
Perl to the rescue!
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
@ARGV = (@ARGV[0, 0]); # Read the input file twice.
my @header = split ' ', <>;
my @in = grep $header[$_] =~ /_in$/, 0 .. $#header;
$_ = <> until eof;
my @exp = split;
my @min;
<>;
while (<>) {
my @F = split;
for my $i (@in) {
$F[$i] = abs($F[$i] - $exp[$i]);
@{ $min[$i] }[0, 1]
= sort { $a->[0] <=> $b->[0] }
[$F[$i], $F[0]], grep defined, @{ $min[$i] // [] }
unless eof;
}
say join "\t", @F;
}
print "\n";
say join "\t", @header[@in];
for my $index (0, 1) {
for my $i (@in) {
next unless $header[$i] =~ /_in$/;
print $min[$i][$index][1], "\t";
}
print "\n";
}
It reads the file twice. In the first read, it just remembers the first line as the @header array and the last line as the @exp array.
In the second read, it subtracts the corresponding exp value from each _in column. It also stores the two least numbers in the @min array at the position corresponding to the column position.
Formatting the numbers (i.e. 0.0
instead of 0
and 02
instead of 2
) left as an exercise to the reader. Same with redirecting the output to several different files.
Upvotes: 2