Reputation: 31
I'm a novice with GNU parallel and I'm only semi-knowledgeable about bash in general so I would really appreciate some advice.
I want to read line by line through an input file containing a file path in the first column and the path to a second file in the second column, and for each line use the columns as input in a command. However, I need to replace part of the file name in column one to make my command work.
The file would look like this, two file paths separated by tabs:
path_to_file/filename1_combined_R1_001.bam \t path_to_file/filename1.fna
path_to_file/filename2_combined_R1_001.bam \t path_to_file/filename2.fna
What I would need to be able to do is remove the string "_R1_001.bam" from column one and replace it with my own string (e.g. _R1_fastq) to invoke a script called removeM
. FYI, I'm not sure if I'm using --colsep
correctly.The command is as follows:
parallel -j10 --colsep '\t' input_file.tsv removeM -1 {1}_R1.fastq -2 {1}_R2.fastq -i {2} -f CoralRemoved_{1}_R1.fastq -r CoralRemoved_{}_R2.fastq`
As far as I can tell I could use basename removal (something like {1.} ) but I can't figure out how to remove more than just the extension (.bam).
Thank you in advance.
Upvotes: 1
Views: 1001
Reputation: 33740
This does not answer the full question, so treat it as a comment.
Version 20170322 introduced dynamic replacement strings, which might be useful here.
A dynamic replacement string is a --rpl
definition that takes an argument. The argument is grabbed with () in the replacement string and used in the code to run as $$1 (and $$2, $$3 ... if there are more ()-groups). Here are a few examples that each correspond to a Bash parameter expansion:
# Bash ${a:-myval}
--rpl '{:-([^}]+?)} $_ ||= $$1',
# Bash ${a:2}
--rpl '{:(\d+?)} substr($_,0,$$1) = ""',
# Bash ${a:2:3}
--rpl '{:(\d+?):(\d+?)} $_ = substr($_,$$1,$$2);',
# Bash ${a#bc}
--rpl '{#([^#][^}]*?)} s/^$$1//;',
# Bash ${a%def}
--rpl '{%([^}]+?)} s/$$1$//;',
# Bash ${a/def/ghi} ${a/def/}
--rpl '{/([^}]+?)/([^}]*?)} s/$$1/$$2/;',
# Bash ${a^a}
--rpl '{^([^}]+?)} s/^($$1)/uc($1)/e;',
# Bash ${a^^a}
--rpl '{^^([^}]+?)} s/($$1)/uc($1)/eg;',
# Bash ${a,A}
--rpl '{,([^}]+?)} s/^($$1)/lc($1)/e;',
# Bash ${a,,A}
--rpl '{,,([^}]+?)} s/($$1)/lc($1)/eg;',
These are, by the way, enabled, if you use --plus
.
So to remove a string (or more accurately: a regexp) from the end you can use:
$ parallel --plus echo {%_R1_001.bam} ::: MyOrganism_R1_001.bam
MyOrganism
Or to replace a string:
$ parallel --plus echo {/_R1_001.bam/_R1.fastq.gz} ::: MyOrganism_R1_001.bam
MyOrganism_R1.fastq.gz
Or you could make your own where you expressed how many .'s or _'s you wanted to remove:
$ parallel --rpl '{_(\d+)} s/([_.][^_.]*){$$1}$//' \
echo {_1} {_2} {_3} ::: filename2_combined_R1_001.bam
filename2_combined_R1_001 filename2_combined_R1 filename2_combined
You could then have this --rpl
definition in your ~/.parallel/config
.
Upvotes: 0
Reputation: 31
I ended up figuring this out for myself. I used --colsep to split the files into fields and then a regex to replace the string. The 1 before the equals signs say to print the first field while the regex within the equal signs do the string replacement.
parallel -j10 --colsep '\t'-a $2 removeM -1 bamToFastq_{=1s/_R1_001.bam//=}_R1.fastq.gz -2 bamToFastq_{=1s/_R1_001.bam//=}_R2.fastq.gz -i {2} -f CoralRemoved_bamToFastq_{1}_R1.fastq -r CoralRemoved_bamToFastq_{1}_R2.fastq
Upvotes: 2
Reputation: 207798
I am having a hard time understanding the exact command you want to run, but I think you can probably alter the file, with sed
, as you feed it into GNU Parallel like this:
sed 's/_R1_001.bam/_R1_fastq/' input_file.tsv | parallel -j10 -colsep '\t' removeM ...
Note that this will not permanently alter your file input_file.tsv
, instead, it modifies it on-the-fly as it passes it to GNU Parallel.
Note also that you can see what it is doing if you just run:
sed 's/_R1_001.bam/_R1_fastq/' input_file.tsv
Upvotes: 0