Steve Robbins
Steve Robbins

Reputation: 31

Running GNU parallel from input file while modifying colsep input

I'm a novice with GNU parallel and I'm only semi-knowledgeable about bash in general so I would really appreciate some advice.

I want to read line by line through an input file containing a file path in the first column and the path to a second file in the second column, and for each line use the columns as input in a command. However, I need to replace part of the file name in column one to make my command work.

The file would look like this, two file paths separated by tabs:

path_to_file/filename1_combined_R1_001.bam \t path_to_file/filename1.fna
path_to_file/filename2_combined_R1_001.bam \t path_to_file/filename2.fna

What I would need to be able to do is remove the string "_R1_001.bam" from column one and replace it with my own string (e.g. _R1_fastq) to invoke a script called removeM. FYI, I'm not sure if I'm using --colsep correctly.The command is as follows:

parallel -j10 --colsep '\t' input_file.tsv removeM -1 {1}_R1.fastq -2 {1}_R2.fastq -i {2}  -f CoralRemoved_{1}_R1.fastq -r CoralRemoved_{}_R2.fastq`

As far as I can tell I could use basename removal (something like {1.} ) but I can't figure out how to remove more than just the extension (.bam).

Thank you in advance.

Upvotes: 1

Views: 1001

Answers (3)

Ole Tange
Ole Tange

Reputation: 33740

This does not answer the full question, so treat it as a comment.

Version 20170322 introduced dynamic replacement strings, which might be useful here.

A dynamic replacement string is a --rpl definition that takes an argument. The argument is grabbed with () in the replacement string and used in the code to run as $$1 (and $$2, $$3 ... if there are more ()-groups). Here are a few examples that each correspond to a Bash parameter expansion:

# Bash ${a:-myval}                                     
--rpl '{:-([^}]+?)} $_ ||= $$1',
# Bash ${a:2}                                                                      
--rpl '{:(\d+?)} substr($_,0,$$1) = ""',
# Bash ${a:2:3}                                                                    
--rpl '{:(\d+?):(\d+?)} $_ = substr($_,$$1,$$2);',
# Bash ${a#bc}                                                                     
--rpl '{#([^#][^}]*?)} s/^$$1//;',
# Bash ${a%def}                                                                    
--rpl '{%([^}]+?)} s/$$1$//;',
# Bash ${a/def/ghi} ${a/def/}                                                      
--rpl '{/([^}]+?)/([^}]*?)} s/$$1/$$2/;',
# Bash ${a^a}                                                                      
--rpl '{^([^}]+?)} s/^($$1)/uc($1)/e;',
# Bash ${a^^a}                                                                     
--rpl '{^^([^}]+?)} s/($$1)/uc($1)/eg;',
# Bash ${a,A}                                                                      
--rpl '{,([^}]+?)} s/^($$1)/lc($1)/e;',
# Bash ${a,,A}                                                                     
--rpl '{,,([^}]+?)} s/($$1)/lc($1)/eg;',

These are, by the way, enabled, if you use --plus.

So to remove a string (or more accurately: a regexp) from the end you can use:

$ parallel --plus echo {%_R1_001.bam} ::: MyOrganism_R1_001.bam
MyOrganism

Or to replace a string:

$ parallel --plus echo {/_R1_001.bam/_R1.fastq.gz} ::: MyOrganism_R1_001.bam
MyOrganism_R1.fastq.gz

Or you could make your own where you expressed how many .'s or _'s you wanted to remove:

$ parallel --rpl '{_(\d+)} s/([_.][^_.]*){$$1}$//' \
   echo {_1} {_2} {_3} ::: filename2_combined_R1_001.bam
filename2_combined_R1_001 filename2_combined_R1 filename2_combined

You could then have this --rpl definition in your ~/.parallel/config.

Upvotes: 0

Steve Robbins
Steve Robbins

Reputation: 31

I ended up figuring this out for myself. I used --colsep to split the files into fields and then a regex to replace the string. The 1 before the equals signs say to print the first field while the regex within the equal signs do the string replacement.

parallel -j10 --colsep '\t'-a $2 removeM -1 bamToFastq_{=1s/_R1_001.bam//=}_R1.fastq.gz -2 bamToFastq_{=1s/_R1_001.bam//=}_R2.fastq.gz -i {2} -f CoralRemoved_bamToFastq_{1}_R1.fastq -r CoralRemoved_bamToFastq_{1}_R2.fastq

Upvotes: 2

Mark Setchell
Mark Setchell

Reputation: 207798

I am having a hard time understanding the exact command you want to run, but I think you can probably alter the file, with sed, as you feed it into GNU Parallel like this:

sed 's/_R1_001.bam/_R1_fastq/' input_file.tsv | parallel -j10 -colsep '\t' removeM ...

Note that this will not permanently alter your file input_file.tsv, instead, it modifies it on-the-fly as it passes it to GNU Parallel.

Note also that you can see what it is doing if you just run:

sed 's/_R1_001.bam/_R1_fastq/' input_file.tsv

Upvotes: 0

Related Questions