user1038055
user1038055

Reputation: 79

concatenating multiple files

I have multiple files, and in each file is the following:

>HM001
ATGCT...
>HM002
ATGTC...
>HM003
ATGCC...

That is, each file contains one gene sequence for species HM001 to HM050. I would like to concatenate all these files, so I have a single file that contains the genome for species HM001 to HM050:

>HM001
ATGCT...ATGAA...ATGTT
>HM002
ATGTC...ATGCT...ATGCT
>HM003
ATGCC...ATGC...ATGAT

The ellipses are not actually required in the final file. I suppose cat should be used, but I'm not sure how. Any ideas would be appreciated.

Upvotes: 3

Views: 177

Answers (5)

glenn jackman
glenn jackman

Reputation: 246764

Another awk implementation:

awk '
    {key=$0; getline; value[key] = value[key] $0} 
    END {for (key in value) {print key; print value[key]}}
' file ...

Now, this will probably not output the keys in sorted order: array keys are inherently unsorted. To ensure sorted output, use gawk and

awk '
    {key=$0; getline; val[key] = val[key] $0} 
    END {
        n = asorti(val, keys)
        for (i=1; i<=n; i++) {print keys[i]; print val[keys[i]]}
    }
' file ...

Upvotes: 0

whereswalden
whereswalden

Reputation: 4959

The simplest way I can think of is to use cat. For example (assuming you're on a *nix-type system):

cat file1 file2 file3 > outfile

Upvotes: 0

Anthony Horne
Anthony Horne

Reputation: 2522

What about appending them using echo - along these lines?:

find . -type f -exec bash -c 'echo "append this" >> "$0"' {} \;

Source: https://stackoverflow.com/a/15604608/1662973

I would do it using "type", but that is MSDOS. The above should work for you.

Upvotes: 0

r2evans
r2evans

Reputation: 160417

Might I suggest converting your group of files into a CSV? It's almost exactly what you're suggesting, and is easily incorporated into just about any application for processing (e.g., Excel, R, python).

Up front, I'll assume that all species and gene sequences are simply alpha-numeric, no spaces or quote-like characters. I'm also assuming access to sed, sort, and uniq, which are all standard in *nix, MacOSX, and easily accessible for windows via msys or cygwin, to name two.

First, generate an array of file names and species. I'm assuming the files are named file1, file2, etc. Just adjust the first line accordingly; it's just a glob, not an executed command.

FILES=($(file*))
SPECIES=($(sed -ne 's/^>//gp' file* | sort | uniq))

This gives us one line per species, sorted, with no repeats. This ensures that our columns are independent and the set is complete.

Next, create a CSV header row with named columns, dumping it into a CSV file named csvfile:

echo -n "\"Species\"" > csvfile
for fn in ${FILES[@]} ; do echo -n ",\"${fn}\"" ; done >> csvfile
echo >> csvfile

Now iterate through each gene sequence and extract it from all files:

for sp in ${SPECIES[@]} ; do
    echo -n "\"${sp}\""
    for fn in ${FILES[@]}; do
        ANS=$(sed -ne '/>'${sp}'/,/^/ { /^[^>]/p }' ${fn})
        echo -n ",\"${ANS}\""
    done
    echo
done >> csvfile

This works but is inefficient for larger data sets (i.e., large numbers of files and/or species). Better implementations (e.g, python, ruby, perl, even R) would read each file once, forming an internally-maintained matrix, dictionary, or associative array, and write out the CSV in one chunk.

Upvotes: 0

jaypal singh
jaypal singh

Reputation: 77085

Data parsing and formatting will be alot easier with awk. Try this:

awk -v RS=">" 'FNR>1{a[$1]=a[$1]?a[$1] FS $2:$2}END{for(x in a) print RS x ORS a[x]}' f1 f2 f3

For files like:

==> f1 <==
>HM001
ATGCT...
>HM002
ATGTC...
>HM003
ATGCC...

==> f2 <==
>HM001
ATGDD...
>HM002
ATGDD...
>HM003
ATGDD...

==> f3 <==
>HM001
ATGEE...
>HM002
ATGEE...
>HM003
ATGEE...

awk -v RS=">" 'FNR>1{a[$1]=a[$1]?a[$1] FS $2:$2}END{for(x in a) print RS x ORS a[x]}' f1 f2 f3
>HM001
ATGCT... ATGDD... ATGEE...
>HM002
ATGTC... ATGDD... ATGEE...
>HM003
ATGCC... ATGDD... ATGEE...

Upvotes: 3

Related Questions