Reputation: 37
I have over 100 files in a directory with format xxx_1_sequence.fastq.gz
and xxx_2_sequence.fastq.gz
The goal is to create a TAB file with 3 columns in this format:
xxx ---> xxx_1_sequence.fastq.gz ---> xxx_2_sequence.fastq.gz
where --->
is a tab.
I was thinking of creating a for loop or maybe using string manipulation in order to achieve this. My knowledge is rudimentary at this stage, so any help would be much appreciated.
Upvotes: 0
Views: 238
Reputation: 37
Thankyou for the help guys- I was thrown into a coding position a week ago with no prior experience and have been struggling.
I ended up with this printf "%s\n" *_1_sequence.fastq.gz | sort | sed 's/\(.*\)_1_sequence.fastq.gz/\1\t\1_1_sequence.fastq.gz\t\1_2_sequence.fastq.gz/ ' > NULLARBORformat.tab
and it does the job perfectly!
Upvotes: 0
Reputation: 22032
Would you please try the following:
shopt -s extglob # enable extended pattern matching
suffix="sequence.fastq.gz"
for f in !(*"$suffix"); do # files which does not match the pattern
if [[ -f ${f}_1_$suffix && -f ${f}_2_$suffix ]]; then
# check the existence of the files just in case
printf "%s\t%s\t%s\n" "$f" "${f}_1_$suffix" "${f}_2_$suffix"
fi
done
Upvotes: 1
Reputation: 6058
If your files are in a directory called files
:
paste -d '\t' \
<(printf "%s\n" files/*_1_sequence.fastq.gz | sort) \
<(printf "%s\n" files/*_2_sequence.fastq.gz | sort) \
| sed 's/\(.*\)_1_sequence.fastq.gz/\1\t\1_1_sequence.fastq.gz/' \
> out.tsv
Explanation:
printf "%s\n"
will print every argument in a new line. So:
printf "%s\n" files/*_1_sequence.fastq.gz | sort
prints a sorted list of the first type of files (the second column in your output). And of course it's symmetrical with *_2_sequence.fastq.gz
(the third column).
(We probably don't need the sort
part, but it helps clarify the intention.)
The syntax <(some shell command)
runs some shell command
, puts its output into a temporary input file, and passes that file as an argument. You can see the temporary file like so:
$ echo <(echo a) <(echo b)
/dev/fd/63 /dev/fd/62
So we are passing 2 (temporary) files to paste
. If each output file has N lines, then paste
outputs N lines, where line number K is a concatenation of line K of each of the files, in order.
For example, if line 4 of the first file is hello
and line 4 if the second file is world
, paste
will have hello\tworld
as line 4 of the output. But instead of trusting the default, we're setting the delimiter to TAB explicitly with -d '\t'
.
That gives us the last 2 columns of our tab-separated-values file, but the first column is the *
part of *_1_sequence.fastq.gz
, which is where sed
comes in.
We tell sed
to replace \(.*\)_1_sequence.fastq.gz
with \1\t\1_1_sequence.fastq.gz
. .*
will match anything, and \(some-pattern\)
tells sed to remember the text that matched the pattern.
The first parentheses in sed
's regex are can be read back into the replacement pattern as \1
, which is why we have \1_1_sequence.fastq.gz
in the replacement pattern.
But now we can also use \1
to create the first column of our tsv, which is why we have \1\t
.
Upvotes: 0