Reputation: 28369

grep variables and give informative ouput

I want to see how many times specific word was mentioned in the file/lines.

My dummy examples looks like this:

cat words
blue
red 
green
yellow 

cat text
TEXTTEXTblueTEXTTEXTblue
TEXTTEXTgreenblueTEXTTEXT
TEXTTEXyeowTTEXTTEXTTEXT

I am doing this:

for i in $(cat words); do grep "$i" text | wc >> output; done

cat output
  2       2      51
  0       0       0
  1       1      26
  0       0       0

But what I actually want to get is:
1. Word that was used as a variable;
2. In how many lines (additionally to text hits) word was found.

Preferable output looks like this:

blue    3   2
red     0   0 
green   1   1
yellow  0   0

$1 - variable that was grep'ed
$2 - how many times variable was found in the text
$3 - in how many lines variable was found

Hope someone could help me doing this with grep, awk, sed as they are fast enough for the large data set, but Perl one liner would help me too.

Edit

Tried this

   for i in $(cat words); do grep "$i" text > out_${i}; done && wc out*

and it kinda looks nice, but some of the words are longer than 300 letters so I can't create file named like the word.

Upvotes: 3

Answers (5)

Ed Morton

Reputation: 204229

awk '
NR==FNR { words[$0]; next }
{
   for (word in words) {
      count = gsub(word,word)
      if (count) {
         counts[word] += count
         lines[word]++
      }
   }
}
END { for (word in words) printf "%s %d %d\n", word, counts[word], lines[word] }
' file

Upvotes: 1

Kent

Reputation: 195209

an awk(gawk) oneliner could save you from grep puzzle:

  awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text

format the code a bit:

awk 'NR==FNR{n[$0];l[$0];next;}
    {for(w in n){ s=$0;
        t=gsub(w,"#",s); 
        n[w]+=t;l[w]+=t>0?1:0;}
    }END{for(x in n)print x,n[x],l[x]}' words text

test with your example:

kent$  awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
yellow  0 0
red  0 0
green 1 1
blue 3 2

if you want to format your output, you could just pipe the awk output to column -t

so it looks like:

yellow  0  0
red     0  0
green   1  1
blue    3  2

Upvotes: 1

amon

Reputation: 57640

Here is a similar Perl solution; but rather written as a complete script.

#!/usr/bin/perl

use 5.012;

die "USAGE: $0 wordlist.txt [text-to-search.txt]\n" unless @ARGV;

my $wordsfile = shift @ARGV;
my @wordlist = do {
    open my $words_fh, "<", $wordsfile or die "Can't open $wordsfile: $!";
    map {chomp; length() ? $_ : ()} <$words_fh>;
};

my %words;
while (<>) {
    for my $word (@wordlist) {
        my $cnt = 0;
        $cnt++ for /\Q$word\E/g;
        $words{$word}[0] += $cnt;
        $words{$word}[1] += 1&!! $cnt; # trick to force 1 or 0.
    }
}

# sorts output after frequency. remove `sort {...}` to get unsorted output.
for my $key (sort {$words{$b}->[0] <=> $words{$a}->[0] or $a cmp $b} keys %words) {
    say join "\t", $key, @{ $words{$key} };
}

Example output:

blue    3       2
green   1       1
red     0       0
yellow  0       0

Advantage over bash script: every file is only read once.

Upvotes: 3

Dave Sherohman

Reputation: 46207

This gets pretty ugly as a Perl one-liner (partly because it needs to get data from two files and only one can be sent on stdin, partly because of the requirement to count both the number of lines matched and the total number of matches), but here you go:

perl -E 'undef $|; open $w, "<", "words"; @w=<$w>; chomp @w; $r{$_}=[0,{}] for @w; my $re = join "|", @w; while(<>) { $l++; while (/($re)/g) { $r{$1}[0]++; $r{$1}[1]{$l}++; } }; say "$_\t$r{$_}[0]\t" . scalar keys %{$r{$_}[1]} for @w' < text

This requires perl 5.10 or later, but changing it to support 5.8 and earlier is trivial. (Change the -E to -e, change say to print, and add a \n at the end of each line of output.)

Output:

blue    3   2
red     0   0
green   1   1
yellow  0   0

Upvotes: 1

Vivek

Reputation: 2020

You can use the grep option -o which print only the matched parts of a matching line, with each match on a separate output line.

while IFS= read -r line; do
    wordcount=$(grep -o "$line" text | wc -l)
    linecount=$(grep -c "$line" text)
    echo $line $wordcount $linecount
done < words | column -t

You can put it all in one line to make it a one liner.

If column gives the "column too long" error, you can use printf provided you know the maximum number of characters. Use the below instead of echo and remove the pipe to column:

printf "%-20s %-2s %-2s\n" "$line" $wordcount $linecount

Replace the 20 with your max word length and the other numbers as well if you need to.

Upvotes: 4

grep variables and give informative ouput

Answers (5)

Related Questions