thewhitestferret
thewhitestferret

Reputation: 37

Creating 3 column TAB file using name of files in directory

I have over 100 files in a directory with format xxx_1_sequence.fastq.gz and xxx_2_sequence.fastq.gz

The goal is to create a TAB file with 3 columns in this format:

xxx ---> xxx_1_sequence.fastq.gz ---> xxx_2_sequence.fastq.gz

where ---> is a tab.

I was thinking of creating a for loop or maybe using string manipulation in order to achieve this. My knowledge is rudimentary at this stage, so any help would be much appreciated.

Upvotes: 0

Views: 238

Answers (3)

thewhitestferret
thewhitestferret

Reputation: 37

Thankyou for the help guys- I was thrown into a coding position a week ago with no prior experience and have been struggling.

I ended up with this printf "%s\n" *_1_sequence.fastq.gz | sort | sed 's/\(.*\)_1_sequence.fastq.gz/\1\t\1_1_sequence.fastq.gz\t\1_2_sequence.fastq.gz/ ' > NULLARBORformat.tab

and it does the job perfectly!

Upvotes: 0

tshiono
tshiono

Reputation: 22032

Would you please try the following:

shopt -s extglob                # enable extended pattern matching
suffix="sequence.fastq.gz"
for f in !(*"$suffix"); do      # files which does not match the pattern
    if [[ -f ${f}_1_$suffix && -f ${f}_2_$suffix ]]; then
                                # check the existence of the files just in case
        printf "%s\t%s\t%s\n" "$f" "${f}_1_$suffix" "${f}_2_$suffix"
    fi
done

Upvotes: 1

root
root

Reputation: 6058

If your files are in a directory called files:

paste -d '\t' \
    <(printf "%s\n" files/*_1_sequence.fastq.gz | sort) \
    <(printf "%s\n" files/*_2_sequence.fastq.gz | sort) \
    | sed 's/\(.*\)_1_sequence.fastq.gz/\1\t\1_1_sequence.fastq.gz/' \
    > out.tsv

Explanation:

printf "%s\n" will print every argument in a new line. So:

printf "%s\n" files/*_1_sequence.fastq.gz | sort

prints a sorted list of the first type of files (the second column in your output). And of course it's symmetrical with *_2_sequence.fastq.gz (the third column).

(We probably don't need the sort part, but it helps clarify the intention.)

The syntax <(some shell command) runs some shell command, puts its output into a temporary input file, and passes that file as an argument. You can see the temporary file like so:

$ echo <(echo a) <(echo b)
/dev/fd/63 /dev/fd/62

So we are passing 2 (temporary) files to paste. If each output file has N lines, then paste outputs N lines, where line number K is a concatenation of line K of each of the files, in order.

For example, if line 4 of the first file is hello and line 4 if the second file is world, paste will have hello\tworld as line 4 of the output. But instead of trusting the default, we're setting the delimiter to TAB explicitly with -d '\t'.

That gives us the last 2 columns of our tab-separated-values file, but the first column is the * part of *_1_sequence.fastq.gz, which is where sed comes in.

We tell sed to replace \(.*\)_1_sequence.fastq.gz with \1\t\1_1_sequence.fastq.gz. .* will match anything, and \(some-pattern\) tells sed to remember the text that matched the pattern.

The first parentheses in sed's regex are can be read back into the replacement pattern as \1, which is why we have \1_1_sequence.fastq.gz in the replacement pattern.

But now we can also use \1 to create the first column of our tsv, which is why we have \1\t.

Upvotes: 0

Related Questions