Reputation: 31
Here's my input :
chr1 58962 -0.042053 -22.525086 -20.817409 -19.525688
chr1 58989 -0.014479 -14.459352 -12.824315 -11.744024
chr1 59155 -0.062963 -13.810858 -12.749009 -12.102778
chr1 59256 -0.014105 -7.371202 -9.117587 -11.525907
I'm looking for a way, in bash, to get the index of the maximum value of the row for each row. I don't want to take into account the first two columns.
I could do it very simply in R :
data=fread(myfile)
maxindex=apply(data[,3:6],1,which.max)
So that the output is an array containing the index. This is the kind of output I want in the end. In this case :
maxindex= 1 1 1 1
Unfortunately the whole file is 32 Gb (big table containing 300000 rows and 8183 columns) so that R can't take it even after I subsidized the original file. I've read that bash isn't made to work by row but is there still a way to do what I want to do?
Upvotes: 2
Views: 1211
Reputation: 11
If you want the script written with basic bash operations, you could do something like this:
#!/bin/bash
# Function to find the max-value of a one-dimensional array
findMax()
{
[[ -z $2 ]] && return # Exit early if the string is empty
declare -a pararr=($@) #Insert the input into an array we can work with
# Basic brute-force algorithm to find the highest value in the array
maxInd=2
for (( i = 3; i < $#; i++ )); do
(( $(echo "${pararr[$i]} > ${pararr[$maxInd]}" | bc) )) && maxInd=$i
done
echo -n " $(( maxInd - 2 ))"
}
echo -n "Maxindex:"
# Feed our findMax row-by-row from the input file
while read -r line; do
findMax $line
done < ${!#}
echo # Append newline at the end
This script takes a file that is formatted as your example, and searches for the max index row by row. However, in the file each row must be seperated by a newline like your example shows, else some wonky stuff may happen. You can of course extend the script to deal with other formats if you wish.
However if you want to do this operation on very large files, i think the solutions provided by the others here will be much better suited. I don't really know much about the overhead of bash since i use C/C++ for most performance-critical applications, but im guessing it's not very efficient.
(( $(echo "${pararr[$i]} > ${pararr[$maxInd]}" | bc) )) && maxInd=$i
This part of the script is really ugly, but i dont know of any better way to do floating-point arithmetic. What we are doing here is we are evaluating our current position in the row with the largest value we have found so far. So this:
echo "${pararr[$i]} > ${pararr[$maxInd]}
Might expand to something like this
0.356 > 1.567
We then pipe it into bc
which does the floating-point comparisons for us. If our current position is greater than the greatest value we have found so far, we set our maxIndex to that value. Hope this helps.
Upvotes: 0
Reputation: 92854
Use the following awk solution, it will go faster than perl approach (on "big" files):
awk '{ m=$3; p=1; for(i=4;i<=NF;i++) {
if ($i>m) { m=$i; p=i-2 } } printf "%d ",p }' file > max_indices
m=$3
- initial maximum value (the 3rd field value)
for(i=4;i<=NF;i++)
- iterating through remaining fields
if ($i>m) { m=$i; p=i-2 }
- capturing maximal value
Upvotes: 2
Reputation: 241828
Perl solution:
perl -ane '$r = 2;
for my $i (3 .. $#F) {
$r = $i if $F[$i] > $F[$r];
}
print $r - 1, " ";
' < input-file > output-file
-n
processes the input line by line-a
splits each line on whitespace into the @F array$r
stores the index of the maximum (set to 2 before processing each line)Upvotes: 0