user3624000
user3624000

Reputation: 311

Shell script to show frequency of each word in file and in a directory

I came across a question in my interview

Shell script to show frequency of each word in file and in a directory

A
    - A1
        - File1.txt
        - File2.txt
    -A2
        - FileA21.txt
    -A3
        - FileA31.txt
        - FileA32.txt
B
    -B1
        - FileB11.txt
        - FileB12.txt
        - FileB13.txt
    -B2
        -FileB21.txt

I believe that I understood the question by understanding that Directories A and B are two separate directories with A1, A2 & A3 being sub-directories of A, and B1 and B2 being sub-directories of B. So I answered like this.

Find . ‘\(-name “A” –and –name “B”\)’ –type f –exec cat ‘{}’ \; | awk ‘{c[$1]++} END {for (i in c) print i, c[i]}’

But still I got an feedback that the above script was not good enough. What's wrong in the given script?

Upvotes: 5

Views: 545

Answers (1)

Filipe Gonçalves
Filipe Gonçalves

Reputation: 21213

The major limitation is that the script assumes there is exactly one word per line. c[$1]++ just increments the occurrence of the first field of each line.

The question didn't mention anything about the number of words in a line, so I'd assume this wasn't the intention - you need to go through each word in a line. Also, what about empty lines? With an empty line, $1 will be the empty string, so your script will end up counting "empty" words (which it will happily show as part of the output).

In awk, the number of fields in a line is stored in the built-in variable NF; thus it is easy to write code to loop through the words and increment the corresponding count (and it has the nice side effect of implicitly ignoring lines without words).

So, I would do something like this instead:

find . -type f -exec cat '{}' \; | awk '{ for (i = 1; i <= NF; i++) w[$i]++ } END { for (i in w) printf("%-10s %10d\n", i, w[i]) }'

I removed the directory names constraints in the argument to find(1) for the sake of conciseness, and to make it more general.

This is (probably) the main issue with your solution, but the question is (intentionally) vague and there are many things left to discuss:

  • Is it case-sensitive? This solution treats World and world as different words. Is this desired?
  • What about punctuation? Should hello and hello! be treated as the same word? What about commas? That is, do we need to parse and ignore punctuation?
  • Speaking of which - what about things like what's vs. what? Do we consider them different words? And it's vs. its? English is tricky!
  • Most important of all (and related to the points above), what exactly defines a word? We assumed a word is a sequence of non-blanks (the default in awk). Is this accurate?
  • If there are no words in the input, what do we do? This solution prints nothing - maybe we should print a warning message?
  • Is there a fixed number of words in a line? Or is it arbitrary? (E.g. if there's exactly one word per line, your solution would be enough)

FWIW, always remember that your success in an interview is not a binary yes/no. It's not like: Oops, you can't do X, so I'm going to reject you. Or: Oops, wrong answer, you're out. More important than the answer is the process that gets you there, and whether or not you are aware of (a) the assumptions you made; and (b) your solution's limitations. The questions above show ability to consider edge cases, ability to clarify assumptions and requirements, etc, which is way more important than getting the "right" script (and probably there's no such thing as The Right Script).

Upvotes: 4

Related Questions