Isak
Isak

Reputation: 545

Count strings in a file, some single words, some full sentences

I want to count the occurrence of certain words and names in a file. The code below incorrectly counts fish and chips as one case of fish and one case of chips, instead of one count of fish and chips.

ngh.txt = 'test file with words fish, steak fish chips fish and chips'

import re
from collections import Counter
wanted = '''
"fish and chips"
fish
chips
steak
'''
cnt = Counter()
words = re.findall('\w+', open('ngh.txt').read().lower())
for word in words:
    if word in wanted:
        cnt[word] += 1
print cnt

Output:

Counter({'fish': 3, 'chips': 2, 'and': 1, 'steak': 1})

What I want is:

Counter({'fish': 2, 'fish and chips': 1, 'chips': 1, 'steak': 1})

(And ideally, I can get the output like this:

fish: 2
fish and chips: 1
chips: 1
steak: 1

)

Upvotes: 1

Views: 81

Answers (3)

Jitendra Kulkarni
Jitendra Kulkarni

Reputation: 833

I am suggesting two algorithms that will work on any patterns and any file. The first algorithm has run time proportional to (number of characters in the file)* number of patterns.

1> For every pattern search all the patterns and create a list of super-patterns. This can be done by matching one pattern such as 'cat' against all patterns to be searched.

patterns = ['cat', 'cat and dogs', 'cat and fish']
superpattern['cat']  = ['cat and dogs', 'cat and fish']

2> Search for 'cat' in the file, let's say result is cat_count 3> Now search for every supper pattern of 'cat' in file and get their counts

for (sp  in superpattern['cat']) :
    sp_count = match sp in file.
    cat_count = cat_count - sp

This a general solution that is brute force. Should be able to come up with a linear time solution if we arrange the patterns in a Trie. Root-->f-->i-->s-->h-->a and so on. Now when you are at h of the fish, and you do not get an a, increment fish_count and go to root. If you get 'a' continue. Anytime you get something un-expected, increment count of most recently found pattern and go to root or go to some other node (the longest match prefix that is a suffix of that other node). This is the Aho-Corasick algorithm, you can look it up on wikipedia or at: http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf

This solution is linear to the number of characters in the file.

Upvotes: 1

Dave F
Dave F

Reputation: 1985

Definition:

Wanted item: A string that is being searched for within the text.

To count wanted items, without re-counting them within longer wanted items, first count the number of times each one occurs within the string. Next, go through the wanted items, from longest to shortest, and as you encounter smaller wanted items that occur in a longer wanted item, subtract the number of results for the longer item from the shorter item. For example, assume your wanted items are "a", "a b", and "a b c", and your text is "a/a/a b/a b c". Searching for each of those individually produces: { "a": 4, "a b": 2, "a b c": 1 }. The desired result is: { "a b c": 1, "a b": #("a b") - #("a b c") = 2 - 1 = 1, "a": #("a") - #("a b c") - #("a b") = 4 - 1 - 1 = 2 }.

def get_word_counts(text, wanted):
    counts = {}; # The number of times a wanted item was read

    # Dictionary mapping word lengths onto wanted items
    #  (in the form of a dictionary where keys are wanted items)
    lengths = {}; 

    # Find the number of times each wanted item occurs
    for item in wanted:
        matches = re.findall('\\b' + item + '\\b', text);

        counts[item] = len(matches)

        l = len(item) # Length of wanted item

        # No wanted item of the same length has been encountered
        if (l not in lengths):
            # Create new dictionary of items of the given length
            lengths[l] = {}

        # Add wanted item to dictionary of items with the given length
        lengths[l][item] = 1

    # Get and sort lenths of wanted items from largest to smallest
    keys = lengths.keys()
    keys.sort(reverse=True)

    # Remove overlapping wanted items from the counts working from
    #  largest strings to smallest strings
    for i in range(1,len(keys)):
        for j in range(0,i):
            for i_item in lengths[keys[i]]:
                for j_item in lengths[keys[j]]:
                    #print str(i)+','+str(j)+': '+i_item+' , '+j_item
                    matches = re.findall('\\b' + i_item + '\\b', j_item);

                    counts[i_item] -= len(matches) * counts[j_item]

    return counts

The following code contains test cases:

tests = [
    {
        'text': 'test file with words fish, steak fish chips fish and '+
            'chips and fries',
        'wanted': ["fish and chips","fish","chips","steak"]
    },
    {
        'text': 'fish, fish and chips, fish and chips and burgers',
        'wanted': ["fish and chips","fish","fish and chips and burgers"]
    },
    {
        'text': 'fish, fish and chips and burgers',
        'wanted': ["fish and chips","fish","fish and chips and burgers"]
    },
    {
        'text': 'My fish and chips and burgers. My fish and chips and '+
            'burgers',
        'wanted': ["fish and chips","fish","fish and chips and burgers"]
    },
    {
        'text': 'fish fish fish',
        'wanted': ["fish fish","fish"]
    },
    {
        'text': 'fish fish fish',
        'wanted': ["fish fish","fish","fish fish fish"]
    }
]

for i in range(0,len(tests)):
    test = tests[i]['text']
    print test
    print get_word_counts(test, tests[i]['wanted'])
    print ''

The output is as follows:

test file with words fish, steak fish chips fish and chips and fries
{'fish and chips': 1, 'steak': 1, 'chips': 1, 'fish': 2}

fish, fish and chips, fish and chips and burgers
{'fish and chips': 1, 'fish and chips and burgers': 1, 'fish': 1}

fish, fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 1, 'fish': 1}

My fish and chips and burgers. My fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 2, 'fish': 0}

fish fish fish
{'fish fish': 1, 'fish': 1}

fish fish fish
{'fish fish fish': 1, 'fish fish': 0, 'fish': 0}

Upvotes: 1

Stidgeon
Stidgeon

Reputation: 2723

So this solution works with your test data (and with some added terms to the test data, just to be thorough), though it can probably be improved upon.

The crux of it is to find occurances of 'and' in the words list and then to replace 'and' and its neighbours with a compound word (concatenating the neighbours with 'and') and adding this back to the list, along with a copy of 'and'.

I also converted the 'wanted' string to a list to handle the 'fish and chips' string as a distinct item.

import re
from collections import Counter

# changed 'wanted' string to a list
wanted = ['fish and chips','fish','chips','steak', 'and']

cnt = Counter()

words = re.findall('\w+', open('ngh.txt').read().lower())

for word in words:

    # look for 'and', replace it and neighbours with 'comp_word'
    # slice, concatenate, and append to make new words list

    if word == 'and':
        and_pos = words.index('and')
        comp_word = str(words[and_pos-1]) + ' and '  +str(words[and_pos+1])
        words = words[:and_pos-1] + words[and_pos+2:]
        words.append(comp_word)
        words.append('and')

for word in words:
    if word in wanted:
        cnt[word] += 1

print cnt

The output from your text would be:

Counter({'fish':2, 'and':1, 'steak':1, 'chips':1, 'fish and chips':1})

As noted in the comment above, it's unclear why you want/expect output to be 2 for fish, 2 for chips, and 1 for fish-and-chips in your ideal output. I'm assuming it's a typo, since the output above it has 'chips':1

Upvotes: 1

Related Questions