icedcoffee
icedcoffee

Reputation: 1015

For every line in a file, determine if string is present within another file

I have a tab delimited text file (animals.txt) with five columns:

302947298 2340974238 0 0 cat
345098948 8345988989 0 0 dog
098982388 2098340923 0 0 fish
932840923 0923840988 0 0 parrot

I have another file, mess.txt.gz, which is compressed using GNU zip (.gz file). It basically looks like a massive string of letters:

sdihfoiahdfosparrotdhiafoihsdfoijaslkdogoieufoiweuf

Basically, for every line in the tab delimited text file, I want to see if any of the animal names are present within this .gz file.

Ideally, it would return something like this:

302947298 2340974238 0 0 cat no
345098948 8345988989 0 0 dog yes
098982388 2098340923 0 0 fish no
932840923 0923840988 0 0 parrot yes

At the moment I am doing the following:

gunzip -cd mess.txt.gz | grep cat
gunzip -cd mess.txt.gz | grep dog

To automate it, I've tried the following:

cat animals.txt | awk '{print $5}' > animal_names.txt

cat animal_names.txt | while read line 
do
   gunzip -cd mess.txt.gz | grep $line > output.txt
done

I've also tried:

cat animal_names.txt | while read line 
do
   if [ gunzip -cd mess.txt.gz | grep $line ]
   then
     echo "Yes"
   else
     echo "No"
   fi
   ; do
done > output.txt

What is the best way to do this in bash?

Upvotes: 3

Views: 1290

Answers (4)

Zilog80
Zilog80

Reputation: 2562

Many nice answers here, and a very good one from @triplee.

Just adding the 'in memory' bash way :

#!/bin/bash
search() {
  local patterns="$2"
  local string="$(gunzip -cd $1)"
  while IFS= read -r line; do
    local pattern="${line/[^$'\t']*$'\t'/}"
    local suffix="no"
    [ "${string/${pattern}/}" != "${string}" ] && suffix="yes"
    echo "${line} ${suffix}"
  done < "${patterns}"
}
search mess.txt animals.txt

The goal here is to limit I/O, one read from the gziped mess.txt, one read from animals and match in memory with strings patterns.

Upvotes: 1

Enlico
Enlico

Reputation: 28416

What about this?

gunzip -cd mess.txt.gz | grep "$(< animals.txt sed -e 's/.*\t//' | sed -z 's/\n/\\|/g;s/\\|$//')"

It is basically the version of your

gunzip -cd mess.txt.gz | grep dog

where, instead of dog, the regex dog\|cat\|whatever is generated from the file animals.txt.

My command should give you the output that you get with the example you write after

To automate it, I've tried the following:

with which you don't end up with the result you refer to as ideal.

Upvotes: 1

tripleee
tripleee

Reputation: 189457

You can pass all the search strings to zgrep -Ff - in one pass:

cut -f5 animals.txt |
zgrep -Ff - mess.txt.gz

The -F option says to look for literal strings, not regular expressions (avoids false positives if the input contains dots or other regex metacharacters, and besides, will be significantly faster) and -f - says to read the search patterns from standard input (i.e. from the pipe from cut).

If you want a list of the matched animals, add an -o option and a brief postprocessing step;

cut -f5 animals.txt |
zgrep -Ff - -o mess.txt.gz |
sort | uniq -c

You can replace | uniq -c with just -u if you don't care how many there were of each.

This works as intended on Linux with GNU grep, but macOS (and thus probably generally *BSD) grep -o only prints the first match in each input line when combined with -f -. If you need *BSD portability, I'd go with either of the other solutions here (currently there's one for sed and one for Awk).

Upvotes: 6

anubhava
anubhava

Reputation: 785246

You may use this awk solution with gzcat:

awk 'BEGIN{FS=OFS="\t"} FNR==NR {s=s $0; next} {print $0, (index(s, $NF) > 1 ? "yes" : "no")}' <(gzcat mess.txt.gz) animals.txt

302947298  2340974238  0  0  cat     no
345098948  8345988989  0  0  dog     yes
098982388  2098340923  0  0  fish    no
932840923  0923840988  0  0  parrot  yes

A more readable form:

awk '
BEGIN {FS=OFS="\t"}
FNR == NR {
   s = s $0
   next
}
{
   print $0, (index(s, $NF) > 1 ? "yes" : "no")
}
' <(gzcat mess.txt.gz) animals.txt

Upvotes: 1

Related Questions