user1987607
user1987607

Reputation: 2157

merging columns from multiple files linux

I have a folder with around a 1000 tab-delimited text files. One half of my files are called sampleX.features.tab and the other half are called sampleX.scores.tab.

the "sampleX" is different for each file. So there is:

sample1.features.tab
sample1.scores.tab
sample2.features.tab
sample2.scores.tab
sample3.features.tab
sample3.scores.tab

all files have the same number of lines.

from each .features.tab I want to extract some columns

cut -f1,5,9,10,19,20

from each .scores.tab I want to extract two columns

cut -f1,7

then I want to combine all these columns in a new file called "sampleX.final.tab" (so sample1.final.tab, sample2.final.tab, ...)

and that's where I'm stuck. How can I pipe these things together in Linux?

Upvotes: 0

Views: 153

Answers (3)

One way is to pipe the output of cut into files:

cut -f1,5,9,10,19,20 sample1.features.tab > features1
cut -f1,7 sample1.scores.tab > scores1

and then paste them together by:

paste features1 scores1

Doing this for 1000s of files I'd write a script looping through the file names.

Update: Above solution is probably the easiest to remember (it's somewhat intuitive). However, if the combination of columns from different files is needed on the fly (for example when plotting with gnuplot) the answer by user liborn works, namely

paste <( cut -f... file1 ) <( cut -f... file 2)

to stdout or

paste <( cut -f... file1 ) <( cut -f... file 2) > newfile

to newfile.

Upvotes: 2

liborm
liborm

Reputation: 2724

You're looking for process substitution. In Bash you do:

paste \
  <( cut -f1,5,9,10,19,20 sample1.features.tab )\
  <( cut -f1,7 sample1.scores.tab )\
> sample1.out

To do this on your whole directory, you'll probably want something like this (you need to install GNU parallel) :

  ls *.scores.tab | 
    cut -f1 -d. | 
    parallel "paste <( cut -f1,5,9,10,19,20 {}.features.tab ) <( cut -f1,7 {}.scores.tab ) > {}.out"

Upvotes: 1

BeyelerStudios
BeyelerStudios

Reputation: 4283

Here's an awk script to do this (note each pair of files need to fit in memory):

# test.awk
#

BEGIN {
}

{
  ext=substr(FILENAME, length(FILENAME) - 10)
  if(match(ext, "scores.tab")) {
    arr[FNR] = (arr[FNR] "      " $1 "  " $7)
  } else {
    arr[FNR] = (arr[FNR] "      " $1 "  " $5 "  " $9 "  " $10 "  " $19 "  " $20)
  }
}

END {
  for (i=1; i<=FNR; i++) {
    sub(/^      /, "", arr[i]);
    print arr[i]
  }
}

then simply loop over your files:

# merge.sh
#

for i in {1..1000}
do

  features="sample$i.features.tab"
  scores="sample$i.scores.tab"
  final="sample$i.final.tab"

  awk -f test.awk $features $scores > $final
done

Upvotes: 0

Related Questions