Johnathan
Johnathan

Reputation: 1907

How to do multiple passes with awk?

I work with huge files (gene expression files); each column represents one sample and each row represents the expression of one specific probe.(The same probes are used for each sample). For example,

Sample1

Probe1
Probe2
...
ProbeN

I can have 43000+ probes and >50 samples. Although I could technically use a 2D array, this would not be efficient once I get files with even more samples. Hence, I was thinking about making multiple passes of the same file (new column each time), apply the algorithm for each column, print the result in a separate file.

I tried a rewind function to start over but the program doesn't follow the same instructions.

for(i = ARGC; i > ARGIND; i--)
   ARGV[i] = ARGV[i-1]

 ARGC++
 ARGV[ARGIND+1] = FILENAME

 nextfile

Do you have any idea?

Thank you!

Upvotes: 0

Views: 546

Answers (2)

n0741337
n0741337

Reputation: 2514

I was beaten to the punch, but since I'd already worked this out - here's a example similar to Paul Hicks that will append the contents of each column to a file based on the column name.

#!/bin/bash

fieldCnt=$(head -n1 $1 | awk '{print NF}')
cnt=1
while [ $cnt -le $fieldCnt ]
do
    awk 'out==""{out=FILENAME"."v} {print $v >> out} END{close(out)}' v=$cnt $1
    cnt=$((cnt+1))
done

If the data filename was data, then it would make a data.1, data.2 up to the number of columns in the file. You'd call it like myscript data. You could add probe work to the body of the awk in the loop (or less messy to put that into a file and use awk -f awkfile v=$cnt $1 as in Paul Hicks's example)

Upvotes: 0

Paul Hicks
Paul Hicks

Reputation: 14009

From a memory-use point of view, this sounds like a job for pipes and shell scripts. If your awk script takes its input from stdin, writes its output to stdout, and takes the column number as a parameter, you can achieve what you want quite easily. It would also allow you to work in a loop or in a single command-line with several pipes.

cat gene-file.in | awk -f yourscript.awk -v col=1 | awk -f yourscript.awk -v col=2 | awk -f yourscript.awk -v col=3 > gene-file.out

.. or ..

#!/bin/bash
cp gene-file.in gene-file.tmp.1
for (( col = 1 ; col <= 10 ; col++ )) ; do
  awk -f yourscript.awk -v col=$col gene-file.tmp.1 > gene-file.tmp.2
  mv gene-file.tmp.2 gene-file.tmp.1
done
mv gene-file.tmp.1 gene-file.out

Or any number of alternative ways of accomplishing the same thing. This way of doing thing would be slower, due to more file writes. But writing a file 50 times or more isn't huge. Your disc cache will cope well.

Upvotes: 1

Related Questions