Nonopov
Nonopov

Reputation: 1

How to concatenate many files using their basenames?

I study genetic data from 288 fish samples (Fish_one, Fish_two ...) I have four files per fish, each with a different suffix. eg. for sample_name Fish_one:

file 1 = "Fish_one.1.fq.gz"
file 2 = "Fish_one.2.fq.gz"
file 3 = "Fish_one.rem.1.fq.gz"
file 4 = "Fish_one.rem.2.fq.gz"

I would like to apply the following concatenate instructions to all my samples, using maybe a text file containing a list of all the sample_name, that would be provided to a loop?

cp sample_name.1.fq.gz sample_name.fq.gz 
cat sample_name.2.fq.gz >> sample_name.fq.gz
cat sample_name.rem.1.fq.gz >> sample_name.fq.gz
cat sample_name.rem.2.fq.gz >> sample_name.fq.gz

In the end, I would have only one file per sample, ideally in a different folder. I would be very grateful to receive a bit of help on this one, even though I'm sure the answer is quite simple for a non-novice!

Many thanks,

Noé

Upvotes: 0

Views: 128

Answers (1)

John Bollinger
John Bollinger

Reputation: 180113

I would like to apply the following concatenate instructions to all my samples, using maybe a text file containing a list of all the sample_name, that would be provided to a loop?

In the first place, the name of the cat command is mnemonic for "concatentate". It accepts multiple command-line arguments naming sources to concatenate together to the standard output, which is exactly what you want to do. It is poor form to use a cp and three cats where a single cat would do.

In the second place, although you certainly could use a file of name stems to drive the operation you describe, it's likely that you don't need to go to the trouble to create or maintain such a file. Globbing will probably do the job satisfactorily. As long as there aren't any name stems that need to be excluded, then, I'd probably go with something like this:

for f in *.rem.1.fq.gz; do
  stem=${f%.rem.1.fq.gz}
  cat "$stem".{1,2,rem.1,rem.2}.fq.gz > "${other_dir}/${stem}.fq.gz"
done

That recognizes the groups present in the current working directory by the members whose names end with .rem.1.fq.gz. It extracts the common name stem from that member's name, then concatenates the four members to the correspondingly-named output file in the directory identified by ${other_dir}. It relies on brace expansion to form the arguments to cat, so as to minimize code and (IMO) improve clarity.

Upvotes: 1

Related Questions