Reputation: 63
I need to number of documents where a particular word occurs
Example:
data = ["This is my pen","That is his pen","This is not my pen"]
desired output:
{'This':2,'is': 3,'my': 2,'pen':3}
{'That':1,'is': 3,'his': 1,'pen':3}
{'This':2,'is': 3,'not': 1,'my': 2,'pen':3}
for sent in documents:
for word in sent.split():
if word in sent:
windoc=dict(Counter(sent.split()))
print(windoc)
Upvotes: 0
Views: 131
Reputation: 2914
Considering that a word shall not be counted more than once per document:
import collections
data = ["This is my pen my pen my pen","That is his pen","This is not my pen"]
deduped = (set(d.split()) for d in data)
freq = collections.Counter(w for d in deduped for w in d)
result = [{ w: freq[w] for w in d } for d in deduped ]
You need to deduplicate the words first (see deduped
above). I made deduped a generator to avoid having an intermediate list sets, but that is going to produce an intermediate set of words for each document anyway.
Alternatively, you could implement your own counter. Implementing your own counter isn't a good idea in general but may be required if memory consumption is critical and you want to avoid the intermediate sets created when iterating over the deduped
generator.
Either way, the time and memory complexity are linear.
Output:
[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
{'That': 1, 'his': 1, 'is': 3, 'pen': 3},
{'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]
Upvotes: 2
Reputation: 7206
from collections import Counter
data = ["This is my pen is is","That is his pen pen pen pen","This is not my pen"]
d = []
for s in data:
for word in set(s.split()):
d.append(word)
wordCount = Counter(d)
for item in data:
result = {}
for word in item.split():
result[word] = wordCount[word]
print (result)
output:
{'This': 2, 'is': 3, 'my': 2, 'pen': 3}
{'That': 1, 'is': 3, 'his': 1, 'pen': 3}
{'This': 2, 'is': 3, 'not': 1, 'my': 2, 'pen': 3}
Upvotes: 1
Reputation: 14689
You can construct a dictionary to hold the words frequency
based on all the available sentences. Then construct the desired output. Here's a working example:
Given the input documents:
In [1]: documents
Out[1]: ['This is my pen', 'That is his pen', 'This is not my pen']
Construct the words frequency dictionary:
In [2]: d = {}
...: for sent in documents:
...: for word in set(sent.split()):
...: d[word] = d.get(word, 0) + 1
...:
Then construct the desired output:
In [3]: result = []
...: for sent in documents:
...: result.append({word: d[word] for word in sent.split()})
...:
In [4]: result
Out[4]:
[{'This': 2, 'is': 3, 'my': 2, 'pen': 3},
{'That': 1, 'his': 1, 'is': 3, 'pen': 3},
{'This': 2, 'is': 3, 'my': 2, 'not': 1, 'pen': 3}]
So, overall, the code looks like this:
documents = ['This is my pen', 'That is his pen', 'This is not my pen']
d = {}
# construct the words frequencies dictionary
for sent in documents:
for word in set(sent.split()):
d[word] = d.get(word, 0) + 1
# format the output in the desired format
result = []
for sent in documents:
result.append({word: d[word] for word in sent.split()})
Upvotes: 1