letheasce
letheasce

Reputation: 15

Grep from file fails but grep with individual lines from the file works

I am trying to extract lines from file genome.gff that contain a line from file suspicious.txt. suspicious.txt was derived from genome.gff and every line should match.

Using grep on a single line from suspicious.txt works as expected:

grep 'gene10002' genome.gff
NC_007082.3 Gnomon  gene    1269632 1273520 .   +   .   ID=gene10002;Dbxref=BEEBASE:GB54789,GeneID:409846;Name=bur;gbkey=Gene;gene=bur;gene_biotype=protein_coding
NC_007082.3 Gnomon  mRNA    1269632 1273520 .   +   .   ID=rna21310;Parent=gene10002;Dbxref=GeneID:409846,Genbank:XM_393336.5,BEEBASE:GB54789;Name=XM_393336.5;gbkey=mRNA;gene=bur;product=burgundy;transcript_id=XM_393336.5

But every variation on using grep from a file that I've been able to think of or find online produces no output or an empty file:

grep -f suspicious.txt genome.gff
grep -F -f suspicious.txt genome.gff
while read line; do grep "$line" genome.gff; done<suspicious.txt
while read line; do grep '$line' genome.gff; done<suspicious.txt
while read line; do grep "${line}" genome.gff; done<suspicious.txt
cat suspicious.txt | while read line; do grep '$line' genome.gff; done
cat suspicious.txt | while read line; do grep '$line' genome.gff >> suspicious.gff; done
cat suspicious.txt | while read line; do grep -e "${line}" genome.gff >> suspicious.gff; done
cat "$(cat suspicious_bee_geneIDs_test.txt)" | while read line; do grep -e "${line}" genome.gff >> suspicious.gff; done

Running it as a script also produces an empty file:

#!/bin/bash
SUSP=$1
GFF=$2

while read -r line; do
        grep -e "${line}" $GFF >> suspicious_bee_genes.gff
done<$SUSP

This is what the files look like:

head genome.gff
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Amel_4.5
#!genome-build-accession NCBI_Assembly:GCF_000002195.4
##sequence-region NC_007070.3 1 29893408
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7460
NC_007070.3 RefSeq  region  1   29893408    .   +   .       ID=id0;Dbxref=taxon:7460;Name=LG1;gbkey=Src;genome=chromosome;linkage-    group=LG1;mol_type=genomic DNA;strain=DH4
NC_007070.3 Gnomon  gene    181 211962  .   -   .   ID=gene0;Dbxref=BEEBASE:GB42164,GeneID:726912;Name=cort;gbkey=Gene;gene=cort;gene_biotype=protein_coding
NC_007070.3 Gnomon  mRNA    181 71559   .   -   .   ID=rna0;Parent=gene0;Dbxref=GeneID:726912,Genbank:XM_006557348.1,BEEBASE:GB42164;Name=XM_006557348.1;gbkey=mRNA;gene=cort;product=cortex%2C transcript variant X2;transcript_id=XM_006557348.1

wc -l genome.gff
457742

head suspicious.txt
gene10002
gene1001
gene1003
gene10038
gene10048
gene10088
gene10132
gene10134
gene10181
gene10209

wc -l suspicious.txt
928

Does anyone know what's going wrong here?

Upvotes: 1

Views: 135

Answers (1)

janos
janos

Reputation: 124646

This can happen when the input file is in DOS format: each line will have a trailing CR character at the end, which will break the matching.

One way to check if this is the case is using hexdump, for example (just the first few lines):

$ hexdump -C suspicious.txt
00000000  67 65 6e 65 31 30 30 30  32 0d 0a 67 65 6e 65 31  |gene10002..gene1|
00000010  30 30 31 0d 0a 67 65 6e  65 31 30 30 33 0d 0a 67  |001..gene1003..g|
00000020  65 6e 65 31 30 30 33 38  0d 0a 67 65 6e 65 31 30  |ene10038..gene10|

In the ASCII representation at the right, notice the .. after each gene. These dots correspond to 0d and 0a. The 0d is the CR character.

Without the CR character, the output should look like this:

$ hexdump -C <(tr -d '\r' < suspicious.txt)
00000000  67 65 6e 65 31 30 30 30  32 0a 67 65 6e 65 31 30  |gene10002.gene10|
00000010  30 31 0a 67 65 6e 65 31  30 30 33 0a 67 65 6e 65  |01.gene1003.gene|
00000020  31 30 30 33 38 0a 67 65  6e 65 31 30 30 34 38 0a  |10038.gene10048.|

Just one . after each gene, corresponding to 0a, and no 0d.

Another way to see the DOS line endings in the vi editor. If you open the file with vi, the status line would show [dos], or you could run the ex command :set ff? to make it tell you the file format (the status line will say fileformat=dos).

You can remove the CR characters on the fly like this:

grep -f <(tr -d '\r' < suspicious.txt) genome.gff

Or you could remove in vi, by running the ex command :set ff=unix and then save the file. There are other command line tools too that can remove the DOS line ending.

Another possibility is that instead of a trailing CR character, you might have trailing whitespace. The output of hexdump -C should make that perfectly clear. After the trailing whitespace characters are removed, the grep -f should work as expected.

Upvotes: 2

Related Questions