Chrys
Chrys

Reputation: 313

Concatenate files through -cat from a Python script

I have a folder full of files whose names look like this:

"Code1_B1_1.1.fq.gz"
"Code1_B1_2.2.fq.gz"
"Code1_B2_1.1.fq.gz"
"Code1_B2_2.2.fq.gz"
...
"Code5_B1_1.1.fq.gz"
"Code5_B1_2.2.fq.gz"
"Code5_B2_1.1.fq.gz"
...
...

etc.

These are DNA sequences. I want to concatenate these files according to the Code number AND the extension. Thus, for example, my files "Code1_B1_1.1.fq.gz" and "Code1_B2_1.1.fq.gz" will be merged in a single "Code1_both_1.1.fq.gz".

Using bash (as a novice), I found out how to list the files I need to concatenate, for example :

ls | grep -E "Code1.*.1.1.fq.gz"

but how can I concatenate them afterwards ? I wanted to simply use the command -cat and save the output into a new file, but how do I retrieve the files I was able to list with -ls ?

... also, ultimately, I would like to perform the whole thing from a Python script that would automatically merge all my files according my two criteria (Code and extension) :)

Thank you in advance for your help!

Chrys

Upvotes: 1

Views: 256

Answers (2)

Pranam Bhat
Pranam Bhat

Reputation: 65

Try to list all files and then grep for the files you want and store it in a file.

ls -ltra | egrep -e 'Code1_B1_1.1.fq.gz|Code1_B1_2.2.fq.gz|Code1_B2_1.1.fq.gz|Code1_B2_2.2.fq.gz' > filename

OR

ls | zip -@m filename.zip

Upvotes: 0

Charles Duffy
Charles Duffy

Reputation: 295383

ls output is for human use, not programmatic consumption; see Why you shouldn't parse the output of ls.

Instead, use a glob expression to form a list of filenames:

zcat Code1*1.1.fq.gz >outfile

...or...

gunzip -c Code1*1.1.fq.gz >outfile

If you need to quote parts of this name for some reason, you can do that so long as you don't quote the * (or any other glob-expression metacharacter):

gunzip -c "Code1"*"1.1.fq.gz"

Note that glob expressions are a bit different from regular expressions: In regex, . is a special character -- so grep -E "Code1.*.1.1.fq.gz" would also match Code1AB1C1DfqEgz as a valid name, since each and every . in the expression is treated that way. In globs, . is not special, and * means zero-or-more-of-anything (as opposed to zero-or-more-of-the-last-thing)

Upvotes: 1

Related Questions