Reputation: 43

Convert sequence list to fasta for multiple files

I have thousands of files, which are a list of sequence names followed by their sequence, one individual per line, something like this:

L.abdalai.LJAMM.14363.SanMartindeLosAndes        CCCTAAGAATAATTTGTT
L.carlosgarini.LJAMM.14070.LagunadelMaule        CCCTAAGAAT-ATTTGTT
L.cf.silvai.DD.038.Sarco                         CCCTAAGAAT-ATTTGTT

And I want to change them to fasta format, so looking something like:

>L.abdalai.LJAMM.14363.SanMartindeLosAndes       
CCCTAAGAATAATTTGTTCAGAAAAGATATTTAATTATAT
>L.carlosgarini.LJAMM.14070.LagunadelMaule
CCCTAAGAAT-ATTTGTTCAGAAAAGATATTTAATTATAT
>L.cf.silvai.DD.038.Sarco
CCCTAAGAAT-ATTTGTTCAGAAAAGATATTTAATTATAT

I work on a Mac.
Thanks!

Upvotes: 2

Answers (3)

Tyl

Reputation: 5252

I believe you simplied your sample input, thus different from your expected output.
If not so, and my solutions not work, please comment under my answer to let me know.

So with awk, you can do it like this:

awk -v OFS="\n" '$1=">" $1' file
>L.abdalai.LJAMM.14363.SanMartindeLosAndes
CCCTAAGAATAATTTGTT
>L.carlosgarini.LJAMM.14070.LagunadelMaule
CCCTAAGAAT-ATTTGTT
>L.cf.silvai.DD.038.Sarco
CCCTAAGAAT-ATTTGTT

If you want to change inplace, please install GNU gawk, and use gawk -i inplace ....
And if you want the line endings be Carriages, add/change to -v ORS="\r" -v OFS="\r"

However, you can also, and maybe it's better to do it with sed:

sed -e 's/\([^[:space:]]*\)[[:space:]]*\([^[:space:]]*\)/>\1\n\2/' file

Add -i'' like this: sed -i'' -e ... to change file inplace.

Upvotes: 2

stack0114106

Reputation: 8711

Using Perl

perl -pe 's/^/</;s/(\S+)\s+(\S+)/$1\n$2CAGAAAAGATATTTAATTATAT/g ' file

with your inputs

$ cat damien.txt
L.abdalai.LJAMM.14363.SanMartindeLosAndes        CCCTAAGAATAATTTGTT
L.carlosgarini.LJAMM.14070.LagunadelMaule        CCCTAAGAAT-ATTTGTT
L.cf.silvai.DD.038.Sarco                         CCCTAAGAAT-ATTTGTT

$ perl -pe 's/^/</;s/(\S+)\s+(\S+)/$1\n$2CAGAAAAGATATTTAATTATAT/g ' damien.txt
<L.abdalai.LJAMM.14363.SanMartindeLosAndes
CCCTAAGAATAATTTGTTCAGAAAAGATATTTAATTATAT
<L.carlosgarini.LJAMM.14070.LagunadelMaule
CCCTAAGAAT-ATTTGTTCAGAAAAGATATTTAATTATAT
<L.cf.silvai.DD.038.Sarco
CCCTAAGAAT-ATTTGTTCAGAAAAGATATTTAATTATAT

$

Upvotes: 2

RavinderSingh13

Reputation: 133518

Could you please try following(created and tested based on your samples, since I don't have mac to didn't test on it).

awk '/^L\./{print ">"$1 ORS $2 "CAGAAAAGATATTTAATTATAT"}'  Input_file

Output will be as follows. If needed you could take it to a output_file by appending > output_file to above command too.

>L.abdalai.LJAMM.14363.SanMartindeLosAndes
CCCTAAGAATAATTTGTTCAGAAAAGATATTTAATTATAT
>L.carlosgarini.LJAMM.14070.LagunadelMaule
CCCTAAGAAT-ATTTGTTCAGAAAAGATATTTAATTATAT
>L.cf.silvai.DD.038.Sarco
CCCTAAGAAT-ATTTGTTCAGAAAAGATATTTAATTATAT

Upvotes: 1

Convert sequence list to fasta for multiple files

Answers (3)

Related Questions