Reputation: 2157
I have a folder with around a 1000 tab-delimited text files. One half of my files are called sampleX.features.tab and the other half are called sampleX.scores.tab.
the "sampleX" is different for each file. So there is:
sample1.features.tab
sample1.scores.tab
sample2.features.tab
sample2.scores.tab
sample3.features.tab
sample3.scores.tab
all files have the same number of lines.
from each .features.tab I want to extract some columns
cut -f1,5,9,10,19,20
from each .scores.tab I want to extract two columns
cut -f1,7
then I want to combine all these columns in a new file called "sampleX.final.tab" (so sample1.final.tab, sample2.final.tab, ...)
and that's where I'm stuck. How can I pipe these things together in Linux?
Upvotes: 0
Views: 153
Reputation: 687
One way is to pipe the output of cut into files:
cut -f1,5,9,10,19,20 sample1.features.tab > features1
cut -f1,7 sample1.scores.tab > scores1
and then paste them together by:
paste features1 scores1
Doing this for 1000s of files I'd write a script looping through the file names.
Update: Above solution is probably the easiest to remember (it's somewhat intuitive). However, if the combination of columns from different files is needed on the fly (for example when plotting with gnuplot) the answer by user liborn works, namely
paste <( cut -f... file1 ) <( cut -f... file 2)
to stdout or
paste <( cut -f... file1 ) <( cut -f... file 2) > newfile
to newfile.
Upvotes: 2
Reputation: 2724
You're looking for process substitution. In Bash you do:
paste \
<( cut -f1,5,9,10,19,20 sample1.features.tab )\
<( cut -f1,7 sample1.scores.tab )\
> sample1.out
To do this on your whole directory, you'll probably want something like this (you need to install GNU parallel) :
ls *.scores.tab |
cut -f1 -d. |
parallel "paste <( cut -f1,5,9,10,19,20 {}.features.tab ) <( cut -f1,7 {}.scores.tab ) > {}.out"
Upvotes: 1
Reputation: 4283
Here's an awk
script to do this (note each pair of files need to fit in memory):
# test.awk
#
BEGIN {
}
{
ext=substr(FILENAME, length(FILENAME) - 10)
if(match(ext, "scores.tab")) {
arr[FNR] = (arr[FNR] " " $1 " " $7)
} else {
arr[FNR] = (arr[FNR] " " $1 " " $5 " " $9 " " $10 " " $19 " " $20)
}
}
END {
for (i=1; i<=FNR; i++) {
sub(/^ /, "", arr[i]);
print arr[i]
}
}
then simply loop over your files:
# merge.sh
#
for i in {1..1000}
do
features="sample$i.features.tab"
scores="sample$i.scores.tab"
final="sample$i.final.tab"
awk -f test.awk $features $scores > $final
done
Upvotes: 0