Reputation: 1235
Using mac command line, I want to grep multiple files and save output to different columns.
Here is an example of one of my files ncbi.nlm.nih.gov/nuccore/NZ_AP013294 (click 'send to', 'file' save format as 'full genbank file'). I have ~2000 of these. I just want LOCUS (this is unique to each file and there is always one of them) and then every time it has tRNA-, e.g. tRNA-Glu (tRNA- appears throughout the document, there could be ~zero to 200 of these).
This works when I do it on one of the files:
grep -e "LOCUS-" -e "tRNA-" *.gbff > output.txt
but I would like the output.txt
file to contain columns, one for each *.gbff
file:
LOCUS XXXX LOCUS XXXX
tRNA-Glu tRNA-Asn
tRNA-Ser tRNA-Ile
tRNA-Glu
... etc.
Upvotes: 1
Views: 1605
Reputation: 203532
Your question is unclear but given one possible interpretation of your requirements, this might be what you want:
awk '
/LOCUS/ {
locus[++numCols] = $0
}
/tRNA/ {
rnas[++rowNrs[numCols],numCols] = $0
numRows = (rowNrs[numCols] > numRows ? rowNrs[numCols] : numRows)
}
END {
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%s", locus[colNr], (colNr<numCols ? OFS : ORS)
}
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%s", rnas[rowNr,colNr], (colNr<numCols ? OFS : ORS)
}
}
}
' *.txt
Untested of course since you didn't provide sample input/output we could test against.
If that's not what you want then whatever it is you DO want will be equally trivial but you'll have to tell/show us what that is in your question first to get a solution for it.
Upvotes: 1
Reputation: 1253
This type of processing is really beyond the capabilities of grep
. You're better off creating a small customized script that does exactly what you want. I would use python
and pandas
to accomplish the task. Here is an implementation, based on my current understanding of your description.
import re
import sys
import pandas as pd
# Regular expressions; adjust as necessary
regex_locus = re.compile('LOCUS\s+(\S+)')
regex_trna = re.compile('"(tRNA-.*)"')
# Aggregate results in a dataframe
df = pd.DataFrame()
total_files = len(sys.argv[1:])
# Iterate over the files given on the command line
for index, file in enumerate(sys.argv[1:]):
print("Processing {:4d}/{:4d}: {}".format(index + 1, total_files, file))
# Per file variables.
locus = None
trna = []
# Open the file for reading
with open(file) as fd:
# Find LOCUS value
for line in fd:
match = regex_locus.search(line)
if match:
locus = match.group(1)
# Done with this section
break
# Find tRNA values
for line in fd:
match = regex_trna.search(line)
if match:
trna.append(match.group(1))
# Done with this file
if locus is not None:
# Add the entry for this file. Convert the list to a pandas Series to account for differing lengths.
df[locus] = pd.Series(trna)
# Write out the final DataFrame as csv
df.to_csv('output.csv', index=False)
# And as text...
with open('output.txt', 'w') as fd:
df.to_string(fd, index=False)
output.csv
and output.txt
. The output format is basically free since the data is in a DataFrame.build_table.py
, you would invoke it as: python build_table.py *.gbff
Upvotes: 1
Reputation: 1207
use awk's pattern matching to find the single LOCUS and any tRNA mentions.
get locus accession for free with awks field splitting as the second item
split out the tRNA products (if any) enclosed in quotes and associate them with the locus accession.
keep track of the max number of tRNA for a locus, this will be number of rows output (plus header)
After all GenBank records in all files given on the command line are processed.
output each locus accession as the header then for every row possible output any tRNA product associated beneath its locus accession.
script:
#! /usr/local/bin/gawk -f
/^LOCUS /{RefSeq = $2}
/^ \/product="tRNA-/{
split($1, tRNA, "\"");
head[RefSeq]++;
LOCUS[RefSeq,head[RefSeq]] = tRNA[2]
if(head[RefSeq] > rows)
rows = head[RefSeq]
}
END{
# to ensure constant order
cols = asorti(head)
for(i=1;i<=cols;i++)
header = header "|" head[i]
print substr(header, 2)
for(r=1; r<=rows; r++){
row = ""
for(c=1; c<=cols; c++)
row = row "|" LOCUS[head[c],r]
print substr(row, 2)
}
}
Directory note: the gbff I grabbed have multiple GenBank records per file
ls -l
-rw-r--r-- 1 tomc tomc 1724943 Mar 31 23:21 bacteria.24.genomic.gbff
-rw-r--r-- 1 tomc tomc 3216313 Mar 31 23:53 bacteria.431.genomic.gbff
-rwxr-xr-x 1 tomc tomc 487 Mar 31 23:51 locus_trna_transpose.awk
Result: note: using pipe as a field separator to make it easier to see
./locus_trna_transpose.awk bacteria*.gbff
NZ_AWIQ01000161|NZ_AWIQ01000167|NZ_AWIQ01000183|NZ_AWKN01000226|NZ_AWKN01000319|NZ_AWLJ01000031|NZ_AWLJ01000043|NZ_AWLJ01000050|NZ_JORA01000025|NZ_JORA01000026|NZ_JORA01000030|NZ_JORA01000031|NZ_JORA01000032|NZ_JORA01000034|NZ_JORA01000036|NZ_JORA01000037
tRNA-binding|tRNA-Arg|tRNA-Met|tRNA-Asn|tRNA-binding|tRNA-Val|tRNA-Leu|tRNA-binding|tRNA-Pro|tRNA-Val|tRNA-Arg|tRNA-Arg|tRNA-Met|tRNA-Phe|tRNA-Met|tRNA-Met
|||||tRNA-Val|||tRNA-Arg|tRNA-Val|tRNA-Arg|tRNA-Arg|tRNA-Met||tRNA-binding|tRNA-Leu
||||||||tRNA-Ala|tRNA-Val||tRNA-Ser|tRNA-Met|||
||||||||tRNA-Ala|tRNA-Val|||tRNA-Gly|||
||||||||tRNA-Ala|tRNA-Lys|||tRNA-modifying|||
||||||||tRNA-Ala|||||||
For those playing along, these input files are grabbed as being smallish from
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/
Upvotes: 2
Reputation: 207465
Updated Answer
I think my original approach is too resource intensive for the number of files you have, so here is a less intensive version:
#!/bin/bash
# Don't barf of no matching files, and allow upper and lower case
shopt -s nullglob nocaseglob
# Loop through all ".txt" files whose names start with "seq"
j=0
for f in seq*txt; do
# Generate output file name
printf -v out "Z%05d.tmp" $j
grep -E "LOCUS\s*NZ_AP013294|tRNA" "$f" > "$out"
((j=j+1))
done
# Paste all output files together in columns
paste Z*tmp > result.txt
# Remove temp files
rm Z*tmp
I only named all the temporary files beginning with "Z" so they only show up at the end in your file browser.
So, save the above in a file called "go" in your HOME directory, then make it executable with:
chmod +x $HOME/go
Then you can run it in any directory by doing:
cd some/place/where/your/files/are
$HOME/go
Original Answer
I think this gives what you want:
parallel 'grep "term" {} > {#}.tmp' ::: *txt ; paste *tmp ; rm *tmp
Sample Output
term1 term1
term3 term2
term3
That says... "Look for term
in all files whose names end in txt
, saving the output from the first in 1.tmp
, the second in 2.tmp
and so on. At the end, paste all the temporary files together in columns and remove them."
You need to know that, with GNU Parallel, {}
stands for "the current parameter/file" and that {#}
stands for "the current job number", and finally that :::
separates the parameters from the commands to be applied.
You can also do it without temporary files. Prefix each line of grep
output with the GNU Parallel job number, then pass the lot in Perl to strip off the job number and replace it with 8 * that many spaces:
parallel --tag-string {#} 'grep "term" {}' ::: *txt | perl -plne 's/^(\d+)\s*/" " x (8*($1-1))/e'
term1
term3
term1
term2
term3
Perl is shipped with all Macs.
For anyone who is interested, this is a variant of the Schwartzian Transform, or the Lisp decorate-sort-undecorate idiom.
I install GNU Parallel with homebrew:
brew install parallel
Though it can also be installed very simply without homebrew since it is only a Perl script and all Macs come with Perl:
(wget -O - pi.dk/3 || curl pi.dk/3/ ) | bash
Upvotes: 2