Reputation: 79
I have multiple files, and in each file is the following:
>HM001
ATGCT...
>HM002
ATGTC...
>HM003
ATGCC...
That is, each file contains one gene sequence for species HM001 to HM050. I would like to concatenate all these files, so I have a single file that contains the genome for species HM001 to HM050:
>HM001
ATGCT...ATGAA...ATGTT
>HM002
ATGTC...ATGCT...ATGCT
>HM003
ATGCC...ATGC...ATGAT
The ellipses are not actually required in the final file. I suppose cat should be used, but I'm not sure how. Any ideas would be appreciated.
Upvotes: 3
Views: 177
Reputation: 246764
Another awk implementation:
awk '
{key=$0; getline; value[key] = value[key] $0}
END {for (key in value) {print key; print value[key]}}
' file ...
Now, this will probably not output the keys in sorted order: array keys are inherently unsorted. To ensure sorted output, use gawk and
awk '
{key=$0; getline; val[key] = val[key] $0}
END {
n = asorti(val, keys)
for (i=1; i<=n; i++) {print keys[i]; print val[keys[i]]}
}
' file ...
Upvotes: 0
Reputation: 4959
The simplest way I can think of is to use cat. For example (assuming you're on a *nix-type system):
cat file1 file2 file3 > outfile
Upvotes: 0
Reputation: 2522
What about appending them using echo - along these lines?:
find . -type f -exec bash -c 'echo "append this" >> "$0"' {} \;
Source: https://stackoverflow.com/a/15604608/1662973
I would do it using "type", but that is MSDOS. The above should work for you.
Upvotes: 0
Reputation: 160417
Might I suggest converting your group of files into a CSV? It's almost exactly what you're suggesting, and is easily incorporated into just about any application for processing (e.g., Excel, R, python).
Up front, I'll assume that all species and gene sequences are simply
alpha-numeric, no spaces or quote-like characters. I'm also assuming
access to sed
, sort
, and uniq
, which are all standard in *nix,
MacOSX, and easily accessible for windows via
msys or
cygwin, to name two.
First, generate an array of file names and species. I'm assuming the
files are named file1
, file2
, etc. Just adjust the first line
accordingly; it's just a glob, not an executed command.
FILES=($(file*))
SPECIES=($(sed -ne 's/^>//gp' file* | sort | uniq))
This gives us one line per species, sorted, with no repeats. This ensures that our columns are independent and the set is complete.
Next, create a CSV header row with named columns, dumping it into a
CSV file named csvfile
:
echo -n "\"Species\"" > csvfile
for fn in ${FILES[@]} ; do echo -n ",\"${fn}\"" ; done >> csvfile
echo >> csvfile
Now iterate through each gene sequence and extract it from all files:
for sp in ${SPECIES[@]} ; do
echo -n "\"${sp}\""
for fn in ${FILES[@]}; do
ANS=$(sed -ne '/>'${sp}'/,/^/ { /^[^>]/p }' ${fn})
echo -n ",\"${ANS}\""
done
echo
done >> csvfile
This works but is inefficient for larger data sets (i.e., large numbers of files and/or species). Better implementations (e.g, python, ruby, perl, even R) would read each file once, forming an internally-maintained matrix, dictionary, or associative array, and write out the CSV in one chunk.
Upvotes: 0
Reputation: 77085
Data parsing and formatting will be alot easier with awk
. Try this:
awk -v RS=">" 'FNR>1{a[$1]=a[$1]?a[$1] FS $2:$2}END{for(x in a) print RS x ORS a[x]}' f1 f2 f3
For files like:
==> f1 <==
>HM001
ATGCT...
>HM002
ATGTC...
>HM003
ATGCC...
==> f2 <==
>HM001
ATGDD...
>HM002
ATGDD...
>HM003
ATGDD...
==> f3 <==
>HM001
ATGEE...
>HM002
ATGEE...
>HM003
ATGEE...
awk -v RS=">" 'FNR>1{a[$1]=a[$1]?a[$1] FS $2:$2}END{for(x in a) print RS x ORS a[x]}' f1 f2 f3
>HM001
ATGCT... ATGDD... ATGEE...
>HM002
ATGTC... ATGDD... ATGEE...
>HM003
ATGCC... ATGDD... ATGEE...
Upvotes: 3