Chris Shaw
Chris Shaw

Reputation: 19

Combining columns of data from files matching a string in bash

I have an unknown number of input files that all match a search string, let's say *.dat, and all have 2 columns of data and equal number of rows. In bash I need to take the 2nd column in each file and append it as a new column in a singular merged file.

Eg:

>>cat File1.dat
1   A
2   B
3   C
>>cat File2.dat
4   D
5   E
6   F
>>cat combined.dat
A   D
B   E
C   F

Here is the code I have tried, the approach I have gone for is to try to loop and append:

for filename in $(ls *.dat); do paste combined.dat <(awk '{print $2}' $filename) >> combined.dat; done

The output format can be anything so long as its tab delimited, and the key is it must work on any number of input files up to...100 approx, where the number isn't known in advance.

Upvotes: 0

Views: 39

Answers (2)

Socowi
Socowi

Reputation: 27205

Awk

Since you already use awk, you could to the whole work in awk:

rm -f combined.dat
awk 'FNR<NR{d="\t"} {a[FNR]=a[FNR] d $2} END{for(i=1;i<=FNR;i++) print a[i]}' *.dat > combined.dat

"Classic" solution by repeated paste

You can repeatedly paste combined.dat and the next found file. The only tricky part is getting the first paste right where combined.dat does not exist or is empty. You could use an if, but that would be boring. Here we use a trick: paste acts like cat when used with only one argument. With arrays we can conveniently specify optional further arguments. We also used sponge from moreutils to make sure that combined.dat is not mangled due to concurrent reads and writes – if you don't want to install sponge you have to use a temporary file or variables instead.

rm -f combined.dat
p=()
for f in *.dat; do
  awk '{print $2}' "$f" | paste "${p[@]}" - | sponge combined.dat
  p=(combined.dat)
done

Hacky solution using a single paste

Alternatively, you could build a bash command and execute that. No worries, eval is save here as printf %q ensures correct quoting.

rm -f combined.dat
eval "paste $(printf "<(awk '{printf \$2}' %q) " *.dat) > combined.dat"

Upvotes: 2

Ca&#237;lin
Ca&#237;lin

Reputation: 51

Short draft, especially inserting the new lines and tabs could be optimized:

#!/bin/bash
nrLines=$(wc -l < `(ls *dat | head -1)` | xargs)
i=1
while [ ${i} -le ${nrLines} ];
do
    for file in $(ls *dat); do
            awk -v line=${i} 'NR==line {printf $2}' ${file} >> consolidatedreport.txt
            echo -en "\t" >> consolidatedreport.txt
    done
i=$[$i+1]
echo "" >> consolidatedreport.txt
done

Be careful that, dependent on how you output data to your new file and how you iterate over your existing files, you might end up iterating over your newly created file. So be sure to either use a different ending other than *dat if you iterate over all files with that ending (I used txt in the example) or place the resulting file in a subfolder.

Upvotes: 0

Related Questions