Parth Parikh
Parth Parikh

Reputation: 260

compare occurrence of a set of words

I have a text file with random words in it. i want to find out which words have maximum occurrence as a pair('hi,hello' OR 'Good,Bye').

Simple.txt

hi there. hello this a dummy file. hello world. you did good job. bye for now.

I have written this command to get the count for each word(hi,hello,good,bye).

cat simple.txt| tr -cs '[:alnum:]' '[\n*]' | sort | uniq -c|grep -E -i  "\<hi\>|\<hello\>|\<good\>|\<bye\>"

this gives me the the occurrence of each word with a count(number of times it occurs) in the file but now how to refine this and get a direct output as "Hi/hello is the pair with maximum occurrence"

Upvotes: 1

Views: 99

Answers (2)

John1024
John1024

Reputation: 113884

To make it more interesting, let's consider this test file:

$ cat >file.txt
You say hello.  I say good bye.  good bye. good bye.

To get a count of all pairs of words:

$ awk -v RS='[[:space:][:punct:]]+' 'NR>1{a[last","$0]++} {last=$0} END{for (pair in a) print a[pair], pair}' file.txt
3 good,bye
1 say,good
2 bye,good
1 I,say
1 You,say
1 hello,I
1 say,hello

To get the single pair with the highest count, we need to sort:

$ awk -v RS='[[:space:][:punct:]]+' 'NR>1{a[last","$0]++} {last=$0} END{for (pair in a) print a[pair], pair}' file.txt | sort -nr | head -1
3 good,bye

How it works

  • -v RS='[[:space:][:punct:]]+'

    This tells awk to use any combination of white space or punctuation as a record separator. This means that each word becomes a record.

  • NR>1{a[last","$0]++}

    For every word after the first, increment the count in associative array a for the combination of the previous and current work.

  • last=$0

    Save the current word in the variable last.

  • END{for (pair in a) print a[pair], pair}

    After we have finished reading the input, print out the results for each pair.

  • sort -nr

    Sort the output numerically in reverse (highest number first) order.

  • head -1

    Select the first line (giving us the pair with the highest count).

Multiline version

For those who prefer their code spread out over multiple lines:

awk -v RS='[[:space:][:punct:]]+' '
    NR>1 {
        a[last","$0]++
    }

    {
        last=$0
    }

    END {
        for (pair in a)
            print a[pair], pair
    }' file.txt | sort -nr | head -1

Upvotes: 3

glenn jackman
glenn jackman

Reputation: 246992

some terse perl:

perl -MList::Util=max,sum0 -slne '
    for $word (m/(\w+)/g) {$count{$word}++}
 } END {
    $pair{$_} = sum0 @count{+split} for ($a, $b);
    $max = max values %pair;
    print "$max => ", {reverse %pair}->{$max};
' -- -a="hi hello" -b="good bye" simple.txt
3 => hi hello

Upvotes: 1

Related Questions