Reputation: 1907
I work with huge files (gene expression files); each column represents one sample and each row represents the expression of one specific probe.(The same probes are used for each sample). For example,
Sample1
Probe1
Probe2
...
ProbeN
I can have 43000+ probes and >50 samples. Although I could technically use a 2D array, this would not be efficient once I get files with even more samples. Hence, I was thinking about making multiple passes of the same file (new column each time), apply the algorithm for each column, print the result in a separate file.
I tried a rewind function to start over but the program doesn't follow the same instructions.
for(i = ARGC; i > ARGIND; i--)
ARGV[i] = ARGV[i-1]
ARGC++
ARGV[ARGIND+1] = FILENAME
nextfile
Do you have any idea?
Thank you!
Upvotes: 0
Views: 546
Reputation: 2514
I was beaten to the punch, but since I'd already worked this out - here's a example similar to Paul Hicks that will append the contents of each column to a file based on the column name.
#!/bin/bash
fieldCnt=$(head -n1 $1 | awk '{print NF}')
cnt=1
while [ $cnt -le $fieldCnt ]
do
awk 'out==""{out=FILENAME"."v} {print $v >> out} END{close(out)}' v=$cnt $1
cnt=$((cnt+1))
done
If the data filename was data
, then it would make a data.1
, data.2
up to the number of columns in the file. You'd call it like myscript data
. You could add probe work to the body of the awk in the loop (or less messy to put that into a file and use awk -f awkfile v=$cnt $1
as in Paul Hicks's example)
Upvotes: 0
Reputation: 14009
From a memory-use point of view, this sounds like a job for pipes and shell scripts. If your awk script takes its input from stdin, writes its output to stdout, and takes the column number as a parameter, you can achieve what you want quite easily. It would also allow you to work in a loop or in a single command-line with several pipes.
cat gene-file.in | awk -f yourscript.awk -v col=1 | awk -f yourscript.awk -v col=2 | awk -f yourscript.awk -v col=3 > gene-file.out
.. or ..
#!/bin/bash
cp gene-file.in gene-file.tmp.1
for (( col = 1 ; col <= 10 ; col++ )) ; do
awk -f yourscript.awk -v col=$col gene-file.tmp.1 > gene-file.tmp.2
mv gene-file.tmp.2 gene-file.tmp.1
done
mv gene-file.tmp.1 gene-file.out
Or any number of alternative ways of accomplishing the same thing. This way of doing thing would be slower, due to more file writes. But writing a file 50 times or more isn't huge. Your disc cache will cope well.
Upvotes: 1