rookie
rookie

Reputation: 63

Find word frequency with respect to word occurrence in documents

I need to number of documents where a particular word occurs

Example:

data = ["This is my pen","That is his pen","This is not my pen"]

desired output:

{'This':2,'is': 3,'my': 2,'pen':3}
{'That':1,'is': 3,'his': 1,'pen':3}
{'This':2,'is': 3,'not': 1,'my': 2,'pen':3}

for sent in documents:
    for word in sent.split():

    if word in sent:

        windoc=dict(Counter(sent.split()))
        print(windoc)

Upvotes: 0

Views: 131

Answers (3)

olivecoder
olivecoder

Reputation: 2914

Considering that a word shall not be counted more than once per document:

import collections

data = ["This is my pen my pen my pen","That is his pen","This is not my pen"]
deduped = (set(d.split()) for d in data)
freq =  collections.Counter(w for d in deduped for w in d)
result = [{ w: freq[w] for w in d } for d in deduped ]

You need to deduplicate the words first (see deduped above). I made deduped a generator to avoid having an intermediate list sets, but that is going to produce an intermediate set of words for each document anyway.

Alternatively, you could implement your own counter. Implementing your own counter isn't a good idea in general but may be required if memory consumption is critical and you want to avoid the intermediate sets created when iterating over the deduped generator.

Either way, the time and memory complexity are linear.

Output:

[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
 {'That': 1, 'his': 1, 'is': 3, 'pen': 3},
 {'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]

Upvotes: 2

ncica
ncica

Reputation: 7206

from collections import Counter

data = ["This is my pen is is","That is his pen pen pen pen","This is not my pen"]

d = []
for s in data:
    for word in set(s.split()):
        d.append(word)

wordCount = Counter(d)
for item in data:
    result = {}
    for word in item.split():
        result[word] = wordCount[word]
    print (result)

output:

{'This': 2, 'is': 3, 'my': 2, 'pen': 3}
{'That': 1, 'is': 3, 'his': 1, 'pen': 3}
{'This': 2, 'is': 3, 'not': 1, 'my': 2, 'pen': 3}

Upvotes: 1

Mohamed Ali JAMAOUI
Mohamed Ali JAMAOUI

Reputation: 14689

You can construct a dictionary to hold the words frequency based on all the available sentences. Then construct the desired output. Here's a working example:

Given the input documents:

In [1]: documents 
Out[1]: ['This is my pen', 'That is his pen', 'This is not my pen']

Construct the words frequency dictionary:

In [2]: d = {}
    ...: for sent in documents:
    ...:     for word in set(sent.split()):    
    ...:         d[word] = d.get(word, 0) + 1
    ...: 

Then construct the desired output:

In [3]: result = []
    ...: for sent in documents:
    ...:     result.append({word: d[word] for word in sent.split()})
    ...:     

In [4]: result 
Out[4]: 
[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
 {'That': 1, 'his': 1, 'is': 3, 'pen': 3},
 {'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]

So, overall, the code looks like this:

documents = ['This is my pen', 'That is his pen', 'This is not my pen']
d = {}
# construct the words frequencies dictionary
for sent in documents:
    for word in set(sent.split()):    
        d[word] = d.get(word, 0) + 1

# format the output in the desired format
result = []
for sent in documents:
    result.append({word: d[word] for word in sent.split()})

Upvotes: 1

Related Questions