Reputation: 2121

Finding the frequency of an expression in all files of a directory

I am trying to write a shell script that will search for a regular expression in each of the files in the current directory without using temp files.

Originally, I did this using a temp file to store echo * | sed 's/ /\n/g' and then looped through each line of this file, using cat on each and then grepping my expression and counting the lines of output. I was having some trouble with temp files being searched and was wondering if I could do everything using variables or some non-temp-files method (I don't really want to create a separate directory for the temp files either).

The problem I was having with variables was that after I had set the value of the variable to the output of echo * | sed 's/ /\n/g', I didn't know how to loop through each line so I could get the expression count from the files.

I just want the following to work (where I hardcode the expression):

% ls
% file1 file2 file3
% ./countMost.sh
% file2(28)
% ls
% file1 file2 file3

signifying that file2 has the most instances of the expression (28 of them).

Upvotes: 1

Answers (3)

theon

Reputation: 14390

This should give you the top ten most common lowercase words (you change change the regex to whatever) in for a bunch files inside a dir called test with counts.

grep -rhoE "[a-z]+" test | sort | uniq -c | sort -r | head
      3 test
      2 wow
      2 what
      2 oh
      2 foo
      2 bar
      1 ham

If you want the count by filename, then remove the h flag on grep

  grep -roE "[a-z]+" test | sort | uniq -c | sort -r | head
      3 test/2:test
      1 test/2:wow
      1 test/2:what
      1 test/2:oh
      1 test/2:foo
      1 test/2:bar
      1 test/1:wow
      1 test/1:what
      1 test/1:oh
      1 test/1:ham

Upvotes: 0

Teudimundo

Reputation: 2670

A similar version of Job Lin solution uses sort args instead of sed:

grep -c -e "^d" file* | sort -n -k2 -t: -r |head -1

(here I look for lines starting with a 'd')

Upvotes: 1

Jon Lin

Reputation: 143906

You can try something like this:

grep -c regex files | sed -e 's/^\(.*\):\(.*\)$/\2 \1/' | sort -r -n | head -n 1

Where regex is your regular expression (can use egrep as well) and the files are your list of files.

Given 3 files:

file1:
qwe
qwe
qwe
asd
zxc

file2:
qwe
asd
zxc

file3:
asd
qwe
qwe
qwe
qwe

and I run:

grep -c 'qwe' file[1-3] | sed -e 's/^\(.*\):\(.*\)$/\2 \1/' | sort -r -n

I get the output:

4 file3
3 file1
1 file2

Additionally, adding the | head -n 1 at the end only gives me:

4 file3

Upvotes: 2

Finding the frequency of an expression in all files of a directory

Answers (3)

Related Questions