For each bigram in list, print number of times it appears in other lists - python NLTK

Question

I am new to coding and could use help. Here is my task: I have a csv of online marketing image titles. It is a single column. Each cell in this column holds the marketing image title text for each ad. It is just a string of words. For instance cell A1 reads: "16 Maddening Tire Fails" and etc etc. To load csv I do:

with open('usethis.csv', 'rb') as f:
    mycsv = csv.reader(f)
    mycsv = list(mycsv)

I initialize a list:

mylist = []

my desire is to take the text in each cell and extract the bigrams. I do that as follows:

for i, c in enumerate(mycsv):
   mylist.append(list(nltk.bigrams(word_tokenize(' '.join(c)))))

mylist then looks like this, but with more data:

[[('16', 'Maddening'), ('Maddening', 'Tire'), ('Tire', 'Fails')], [('16', 'Maddening'), ('Maddening', 'Tire'), ('Tire', 'Fails'), ('Fails', 'That'), ('That', 'Show'), ('Show', 'What'), ('What', 'True'), ('True', 'Negligence'), ('Negligence', 'Looks'), ('Looks', 'Like')]

mylist holds individual lists which are the bigrams created from each cell in my csv.

Now I am wanting to loop through every bigram in all lists and next to each bigram print the number of times it appears in another list (cell). This would be the same as a countifs in excel, basically. For instance, if the bigram "('16', 'Maddening')" in the first list (cell A1) appears 3 other times in (mylist) then print the number 3 next to it. And so on for each bigram. If it is easier to return this information into a new list that's fine. Just printing it out somewhere that makes sense.

I have done a lot of reading online, for instance this link kind of was along the general idea: How to check if all elements of a list matches a condition?

And also this link about dictionaries was similar in that it is returning a number next to each value as I want to return a count next to each bigram.. What are Python dictionary view objects?....

But I really am at a loss as to how to do this. Thank you so much in advance for your help! Let me know if I need to explain something better.

lenz · Accepted Answer

You can use collections.Counter for this task. Since you are already using NLTK, FreqDist and and derived classes might come in handy when you want to do more than just counting, but for now let's stick with the simpler Counter.

Counter is a subclass of dict, ie. it can do everthing a dictionary can, but it has additional functionality.

The following snippet extends the code you showed:

from collections import Counter

bigram_counts = Counter()
for cell in mylist:
    for bigram in cell:
        bigram_counts[bigram] += 1

After this, you can look up individual bigrams with subscript, eg. bigram_counts['16', 'Maddening'] will return 3 or whatever the actual count was. With bigram_counts.most_common(5) you get the 5 most frequent bigrams.

Update

... to actually answer the specific problem in your question.

In order to know the number of occurrences in all but one cell, you need to have separate counters for each cell. Replace the previous snippet with the following:

# Populate n+1 counters.
bigram_totals = Counter()
separate_counters = []
for cell in mylist:
    bigram_current = Counter()
    separate_counters.append(bigram_current)
    for bigram in cell:
        bigram_totals[bigram] += 1
        bigram_current[bigram] += 1

# Look up all bigram counts.
for cell, bigram_current in zip(mylist, separate_counters):
    for bigram in cell:
        count = bigram_totals[bigram] - bigram_current[bigram]
        # print(bigram, count) or whatever...

So, in addition to the total counts, we have a separate counter for each cell. When doing a lookup, we subtract the local count from the global count to get the sum of occurrences everywhere else.

Btw, since you mentioned learning purposes, the first block can be written a bit shorter by taking advantage of special Counter features:

# Populate n+1 counters.
bigram_totals = Counter()
separate_counters = []
for cell in mylist:
    bigram_current = Counter(cell)
    separate_counters.append(bigram_current)
    bigram_totals.update(bigram_current)

I think this is a bit more elegant, but might harder to understand for a beginner. Decide for yourself which version you think is more readable.

For each bigram in list, print number of times it appears in other lists - python NLTK

Answers (1)

Update

Related Questions