Reputation: 15
I am trying to extract lines from file genome.gff that contain a line from file suspicious.txt. suspicious.txt was derived from genome.gff and every line should match.
Using grep on a single line from suspicious.txt works as expected:
grep 'gene10002' genome.gff
NC_007082.3 Gnomon gene 1269632 1273520 . + . ID=gene10002;Dbxref=BEEBASE:GB54789,GeneID:409846;Name=bur;gbkey=Gene;gene=bur;gene_biotype=protein_coding
NC_007082.3 Gnomon mRNA 1269632 1273520 . + . ID=rna21310;Parent=gene10002;Dbxref=GeneID:409846,Genbank:XM_393336.5,BEEBASE:GB54789;Name=XM_393336.5;gbkey=mRNA;gene=bur;product=burgundy;transcript_id=XM_393336.5
But every variation on using grep from a file that I've been able to think of or find online produces no output or an empty file:
grep -f suspicious.txt genome.gff
grep -F -f suspicious.txt genome.gff
while read line; do grep "$line" genome.gff; done<suspicious.txt
while read line; do grep '$line' genome.gff; done<suspicious.txt
while read line; do grep "${line}" genome.gff; done<suspicious.txt
cat suspicious.txt | while read line; do grep '$line' genome.gff; done
cat suspicious.txt | while read line; do grep '$line' genome.gff >> suspicious.gff; done
cat suspicious.txt | while read line; do grep -e "${line}" genome.gff >> suspicious.gff; done
cat "$(cat suspicious_bee_geneIDs_test.txt)" | while read line; do grep -e "${line}" genome.gff >> suspicious.gff; done
Running it as a script also produces an empty file:
#!/bin/bash
SUSP=$1
GFF=$2
while read -r line; do
grep -e "${line}" $GFF >> suspicious_bee_genes.gff
done<$SUSP
This is what the files look like:
head genome.gff
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Amel_4.5
#!genome-build-accession NCBI_Assembly:GCF_000002195.4
##sequence-region NC_007070.3 1 29893408
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7460
NC_007070.3 RefSeq region 1 29893408 . + . ID=id0;Dbxref=taxon:7460;Name=LG1;gbkey=Src;genome=chromosome;linkage- group=LG1;mol_type=genomic DNA;strain=DH4
NC_007070.3 Gnomon gene 181 211962 . - . ID=gene0;Dbxref=BEEBASE:GB42164,GeneID:726912;Name=cort;gbkey=Gene;gene=cort;gene_biotype=protein_coding
NC_007070.3 Gnomon mRNA 181 71559 . - . ID=rna0;Parent=gene0;Dbxref=GeneID:726912,Genbank:XM_006557348.1,BEEBASE:GB42164;Name=XM_006557348.1;gbkey=mRNA;gene=cort;product=cortex%2C transcript variant X2;transcript_id=XM_006557348.1
wc -l genome.gff
457742
head suspicious.txt
gene10002
gene1001
gene1003
gene10038
gene10048
gene10088
gene10132
gene10134
gene10181
gene10209
wc -l suspicious.txt
928
Does anyone know what's going wrong here?
Upvotes: 1
Views: 135
Reputation: 124646
This can happen when the input file is in DOS format: each line will have a trailing CR character at the end, which will break the matching.
One way to check if this is the case is using hexdump
, for example (just the first few lines):
$ hexdump -C suspicious.txt 00000000 67 65 6e 65 31 30 30 30 32 0d 0a 67 65 6e 65 31 |gene10002..gene1| 00000010 30 30 31 0d 0a 67 65 6e 65 31 30 30 33 0d 0a 67 |001..gene1003..g| 00000020 65 6e 65 31 30 30 33 38 0d 0a 67 65 6e 65 31 30 |ene10038..gene10|
In the ASCII representation at the right, notice the ..
after each gene. These dots correspond to 0d
and 0a
. The 0d
is the CR character.
Without the CR character, the output should look like this:
$ hexdump -C <(tr -d '\r' < suspicious.txt) 00000000 67 65 6e 65 31 30 30 30 32 0a 67 65 6e 65 31 30 |gene10002.gene10| 00000010 30 31 0a 67 65 6e 65 31 30 30 33 0a 67 65 6e 65 |01.gene1003.gene| 00000020 31 30 30 33 38 0a 67 65 6e 65 31 30 30 34 38 0a |10038.gene10048.|
Just one .
after each gene, corresponding to 0a
, and no 0d
.
Another way to see the DOS line endings in the vi
editor. If you open the file with vi
, the status line would show [dos]
, or you could run the ex command :set ff?
to make it tell you the file format (the status line will say fileformat=dos
).
You can remove the CR characters on the fly like this:
grep -f <(tr -d '\r' < suspicious.txt) genome.gff
Or you could remove in vi
, by running the ex command :set ff=unix
and then save the file. There are other command line tools too that can remove the DOS line ending.
Another possibility is that instead of a trailing CR character, you might have trailing whitespace. The output of hexdump -C
should make that perfectly clear. After the trailing whitespace characters are removed, the grep -f
should work as expected.
Upvotes: 2