Reputation: 1235

Grep multiple files and save output to different columns

Using mac command line, I want to grep multiple files and save output to different columns.

Here is an example of one of my files ncbi.nlm.nih.gov/nuccore/NZ_AP013294 (click 'send to', 'file' save format as 'full genbank file'). I have ~2000 of these. I just want LOCUS (this is unique to each file and there is always one of them) and then every time it has tRNA-, e.g. tRNA-Glu (tRNA- appears throughout the document, there could be ~zero to 200 of these).

This works when I do it on one of the files:

grep -e "LOCUS-" -e "tRNA-" *.gbff > output.txt

but I would like the output.txt file to contain columns, one for each *.gbff file:

LOCUS XXXX     LOCUS XXXX
tRNA-Glu        tRNA-Asn
tRNA-Ser        tRNA-Ile
                tRNA-Glu

... etc.

Upvotes: 1

Answers (4)

Ed Morton

Reputation: 203532

Your question is unclear but given one possible interpretation of your requirements, this might be what you want:

awk '
/LOCUS/ {
    locus[++numCols] = $0
}
/tRNA/ {
    rnas[++rowNrs[numCols],numCols] = $0
    numRows = (rowNrs[numCols] > numRows ? rowNrs[numCols] : numRows)
}
END {
    for (colNr=1; colNr<=numCols; colNr++) {
        printf "%s%s", locus[colNr], (colNr<numCols ? OFS : ORS)
    }
    for (rowNr=1; rowNr<=numRows; rowNr++) {
        for (colNr=1; colNr<=numCols; colNr++) {
            printf "%s%s", rnas[rowNr,colNr], (colNr<numCols ? OFS : ORS)
        }
    }
}
' *.txt

Untested of course since you didn't provide sample input/output we could test against.

If that's not what you want then whatever it is you DO want will be equally trivial but you'll have to tell/show us what that is in your question first to get a solution for it.

Upvotes: 1

cryptoplex

Reputation: 1253

This type of processing is really beyond the capabilities of grep. You're better off creating a small customized script that does exactly what you want. I would use python and pandas to accomplish the task. Here is an implementation, based on my current understanding of your description.

Code

import re
import sys
import pandas as pd

# Regular expressions; adjust as necessary
regex_locus = re.compile('LOCUS\s+(\S+)')
regex_trna  = re.compile('"(tRNA-.*)"')

# Aggregate results in a dataframe
df = pd.DataFrame()

total_files = len(sys.argv[1:])

# Iterate over the files given on the command line
for index, file in enumerate(sys.argv[1:]):

    print("Processing {:4d}/{:4d}: {}".format(index + 1, total_files, file))

    # Per file variables.
    locus = None
    trna  = []

    # Open the file for reading
    with open(file) as fd:

        # Find LOCUS value
        for line in fd:
            match = regex_locus.search(line)

            if match:
                locus = match.group(1)
                
                # Done with this section
                break

        # Find tRNA values
        for line in fd:
            match = regex_trna.search(line)

            if match:
                trna.append(match.group(1))

    # Done with this file
    if locus is not None:
        # Add the entry for this file. Convert the list to a pandas Series to account for differing lengths.
        df[locus] = pd.Series(trna)

# Write out the final DataFrame as csv
df.to_csv('output.csv', index=False)

# And as text...
with open('output.txt', 'w') as fd:
    df.to_string(fd, index=False)

Notes

I'm assuming the LOCUS line always occurs before the tRNA values. This seems to be the format from your example file.
The code generates output.csv and output.txt. The output format is basically free since the data is in a DataFrame.
If the script was named build_table.py, you would invoke it as: python build_table.py *.gbff

Upvotes: 1

tomc

Reputation: 1207

use awk's pattern matching to find the single LOCUS and any tRNA mentions.

get locus accession for free with awks field splitting as the second item

split out the tRNA products (if any) enclosed in quotes and associate them with the locus accession.

keep track of the max number of tRNA for a locus, this will be number of rows output (plus header)

After all GenBank records in all files given on the command line are processed.

output each locus accession as the header then for every row possible output any tRNA product associated beneath its locus accession.

script:

#! /usr/local/bin/gawk -f

/^LOCUS       /{RefSeq = $2}

/^                     \/product="tRNA-/{
    split($1, tRNA, "\"");
    head[RefSeq]++;  
    LOCUS[RefSeq,head[RefSeq]] = tRNA[2]
    if(head[RefSeq] > rows)
        rows = head[RefSeq] 
}

END{
    # to ensure constant order
    cols = asorti(head)
    for(i=1;i<=cols;i++)
        header = header "|" head[i]
    print substr(header, 2)

    for(r=1; r<=rows; r++){
        row = ""
        for(c=1; c<=cols; c++)
            row = row "|" LOCUS[head[c],r]
        print substr(row, 2)
    }
}

Directory note: the gbff I grabbed have multiple GenBank records per file

 ls -l

-rw-r--r-- 1 tomc tomc 1724943 Mar 31 23:21 bacteria.24.genomic.gbff
-rw-r--r-- 1 tomc tomc 3216313 Mar 31 23:53 bacteria.431.genomic.gbff
-rwxr-xr-x 1 tomc tomc     487 Mar 31 23:51 locus_trna_transpose.awk

Result: note: using pipe as a field separator to make it easier to see

./locus_trna_transpose.awk bacteria*.gbff
 NZ_AWIQ01000161|NZ_AWIQ01000167|NZ_AWIQ01000183|NZ_AWKN01000226|NZ_AWKN01000319|NZ_AWLJ01000031|NZ_AWLJ01000043|NZ_AWLJ01000050|NZ_JORA01000025|NZ_JORA01000026|NZ_JORA01000030|NZ_JORA01000031|NZ_JORA01000032|NZ_JORA01000034|NZ_JORA01000036|NZ_JORA01000037
tRNA-binding|tRNA-Arg|tRNA-Met|tRNA-Asn|tRNA-binding|tRNA-Val|tRNA-Leu|tRNA-binding|tRNA-Pro|tRNA-Val|tRNA-Arg|tRNA-Arg|tRNA-Met|tRNA-Phe|tRNA-Met|tRNA-Met
|||||tRNA-Val|||tRNA-Arg|tRNA-Val|tRNA-Arg|tRNA-Arg|tRNA-Met||tRNA-binding|tRNA-Leu
||||||||tRNA-Ala|tRNA-Val||tRNA-Ser|tRNA-Met|||
||||||||tRNA-Ala|tRNA-Val|||tRNA-Gly|||
||||||||tRNA-Ala|tRNA-Lys|||tRNA-modifying|||
||||||||tRNA-Ala|||||||

For those playing along, these input files are grabbed as being smallish from

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

Upvotes: 2

Mark Setchell

Reputation: 207465

Updated Answer

I think my original approach is too resource intensive for the number of files you have, so here is a less intensive version:

#!/bin/bash

# Don't barf of no matching files, and allow upper and lower case
shopt -s nullglob nocaseglob

# Loop through all ".txt" files whose names start with "seq"
j=0
for f in seq*txt; do
   # Generate output file name
   printf -v out "Z%05d.tmp" $j
   grep -E "LOCUS\s*NZ_AP013294|tRNA" "$f" > "$out"
   ((j=j+1))
done

# Paste all output files together in columns
paste Z*tmp > result.txt

# Remove temp files
rm Z*tmp

I only named all the temporary files beginning with "Z" so they only show up at the end in your file browser.

So, save the above in a file called "go" in your HOME directory, then make it executable with:

chmod +x $HOME/go

Then you can run it in any directory by doing:

cd some/place/where/your/files/are
$HOME/go

Original Answer

I think this gives what you want:

parallel 'grep "term" {} > {#}.tmp' ::: *txt ; paste *tmp ; rm *tmp

Sample Output

term1   term1
term3   term2
        term3

That says... "Look for term in all files whose names end in txt, saving the output from the first in 1.tmp, the second in 2.tmp and so on. At the end, paste all the temporary files together in columns and remove them."

You need to know that, with GNU Parallel, {} stands for "the current parameter/file" and that {#} stands for "the current job number", and finally that ::: separates the parameters from the commands to be applied.

You can also do it without temporary files. Prefix each line of grep output with the GNU Parallel job number, then pass the lot in Perl to strip off the job number and replace it with 8 * that many spaces:

parallel --tag-string {#} 'grep "term" {}' ::: *txt | perl -plne 's/^(\d+)\s*/" " x (8*($1-1))/e'

term1
term3
        term1
        term2
        term3

Perl is shipped with all Macs.

For anyone who is interested, this is a variant of the Schwartzian Transform, or the Lisp decorate-sort-undecorate idiom.

I install GNU Parallel with homebrew:

brew install parallel

Though it can also be installed very simply without homebrew since it is only a Perl script and all Macs come with Perl:

(wget -O - pi.dk/3 || curl pi.dk/3/ ) | bash

Upvotes: 2

Grep multiple files and save output to different columns

Answers (4)

Code

Notes

Related Questions