Reputation: 1015
I have a tab delimited text file (animals.txt) with five columns:
302947298 2340974238 0 0 cat
345098948 8345988989 0 0 dog
098982388 2098340923 0 0 fish
932840923 0923840988 0 0 parrot
I have another file, mess.txt.gz, which is compressed using GNU zip (.gz file). It basically looks like a massive string of letters:
sdihfoiahdfosparrotdhiafoihsdfoijaslkdogoieufoiweuf
Basically, for every line in the tab delimited text file, I want to see if any of the animal names are present within this .gz file.
Ideally, it would return something like this:
302947298 2340974238 0 0 cat no
345098948 8345988989 0 0 dog yes
098982388 2098340923 0 0 fish no
932840923 0923840988 0 0 parrot yes
At the moment I am doing the following:
gunzip -cd mess.txt.gz | grep cat
gunzip -cd mess.txt.gz | grep dog
To automate it, I've tried the following:
cat animals.txt | awk '{print $5}' > animal_names.txt
cat animal_names.txt | while read line
do
gunzip -cd mess.txt.gz | grep $line > output.txt
done
I've also tried:
cat animal_names.txt | while read line
do
if [ gunzip -cd mess.txt.gz | grep $line ]
then
echo "Yes"
else
echo "No"
fi
; do
done > output.txt
What is the best way to do this in bash?
Upvotes: 3
Views: 1290
Reputation: 2562
Many nice answers here, and a very good one from @triplee.
Just adding the 'in memory' bash way :
#!/bin/bash
search() {
local patterns="$2"
local string="$(gunzip -cd $1)"
while IFS= read -r line; do
local pattern="${line/[^$'\t']*$'\t'/}"
local suffix="no"
[ "${string/${pattern}/}" != "${string}" ] && suffix="yes"
echo "${line} ${suffix}"
done < "${patterns}"
}
search mess.txt animals.txt
The goal here is to limit I/O, one read from the gziped mess.txt, one read from animals and match in memory with strings patterns.
Upvotes: 1
Reputation: 28416
What about this?
gunzip -cd mess.txt.gz | grep "$(< animals.txt sed -e 's/.*\t//' | sed -z 's/\n/\\|/g;s/\\|$//')"
It is basically the version of your
gunzip -cd mess.txt.gz | grep dog
where, instead of dog
, the regex dog\|cat\|whatever
is generated from the file animals.txt
.
My command should give you the output that you get with the example you write after
To automate it, I've tried the following:
with which you don't end up with the result you refer to as ideal.
Upvotes: 1
Reputation: 189457
You can pass all the search strings to zgrep -Ff -
in one pass:
cut -f5 animals.txt |
zgrep -Ff - mess.txt.gz
The -F
option says to look for literal strings, not regular expressions (avoids false positives if the input contains dots or other regex metacharacters, and besides, will be significantly faster) and -f -
says to read the search patterns from standard input (i.e. from the pipe from cut
).
If you want a list of the matched animals, add an -o
option and a brief postprocessing step;
cut -f5 animals.txt |
zgrep -Ff - -o mess.txt.gz |
sort | uniq -c
You can replace | uniq -c
with just -u
if you don't care how many there were of each.
This works as intended on Linux with GNU grep
, but macOS (and thus probably generally *BSD) grep -o
only prints the first match in each input line when combined with -f -
. If you need *BSD portability, I'd go with either of the other solutions here (currently there's one for sed
and one for Awk).
Upvotes: 6
Reputation: 785246
You may use this awk
solution with gzcat
:
awk 'BEGIN{FS=OFS="\t"} FNR==NR {s=s $0; next} {print $0, (index(s, $NF) > 1 ? "yes" : "no")}' <(gzcat mess.txt.gz) animals.txt
302947298 2340974238 0 0 cat no
345098948 8345988989 0 0 dog yes
098982388 2098340923 0 0 fish no
932840923 0923840988 0 0 parrot yes
A more readable form:
awk '
BEGIN {FS=OFS="\t"}
FNR == NR {
s = s $0
next
}
{
print $0, (index(s, $NF) > 1 ? "yes" : "no")
}
' <(gzcat mess.txt.gz) animals.txt
Upvotes: 1