Reputation: 4608
I have three lists as follows.
mylist = [[5274919, ["my cat", "little dog", "fish", "rat"]],
[5274920, ["my cat", "parrot", "little dog"]],
[5274991, ["little dog", "fish", "duck"]]]
myconcepts = ["my cat", "little dog"]
hatedconcepts = ["rat", "parrot"]
For each concept in myconcepts
, I want to get the count every other concept connected to it using mylist
. Then remove the hatedconcepts
from it. So, my output should looks like as follows.
{"my cat": [("my cat", 2), ("little dog", 2), ("fish", 1)],
"little dog": [("little dog", 3), ("my cat", 2), ("fish", 2), ("duck", 1)]}
I was using this code to do it.
import collections
myoutput = []
for concept in myconcepts:
mykeywords = []
for item in mylist:
if concept in item[1]:
for mykeyword in item[1]:
if mykeyword in hatedconcepts:
pass
else:
mykeywords.append(mykeyword)
if len(mykeywords) > 0:
sorted_keywords = collections.Counter(mykeywords).most_common()
myoutput.append(tuple((concept, sorted_keywords)))
print(myoutput)
The output of the code is:
[('my cat', [('my cat', 2), ('little dog', 2), ('fish', 1)]), ('little dog', [('little dog', 3), ('my cat', 2), ('fish', 2), ('duck', 1)])]
However now I have a huge mylist
with a size of 3GB and nearly 9000 myconcepts
. The hatedconcepts
count is only 20. It looks like it takes about two weeks to run using my current code. The main reason for this could be that my current program is O^3
which is not very efficient. Therefore, I am looking for ways to make my current program more efficient. I am even fine with pythonic solutions that even take 5-6 days to run. Please let me know your thoughts.
I have added a portion of mylist
in: https://drive.google.com/file/d/1M3EhIRwwKwD3Kv4zDsmXaH1D73tx0eF3/view?usp=sharing just to get some idea how it looks like.
I am happy to provide more details if needed.
Upvotes: 17
Views: 2472
Reputation: 4772
This is a nice question with a nice example dataset.
You should be able to get it down to run times of a couple of hours and using only a very small amount of memory by adopting a streaming accumulator architecture.
I note that the contents of the sublists are very regular just lists with repeated nesting structure and strings that don't contain any square brackets themselves. This allows following the following strategy:
Define a chunk_size that is ideally more than twice the number of characters in the text decribing asublist from the file, I used ~2Megs.
Read the next chunk_size chars from the file into a buffer.
Use a regular expression.finditer to extract successive sub_lists from the buffer
eval() each sub_list then run an Accumulator class instance on the result to increment the counts according to your inclusion/exclusion rules.
repeat for all sub_lists in the buffer
save the characters not used at the end of the buffer by finditer
repeat for the next buffer from the file, (prepending any characters not used at the end of the previous buffer)
Extract the running totals from the Accumulator instance when the file is exhausted.
I first found the phrases used the most in your 11meg example then used the 1st, 3rd and 5th for a test list of words to include and the 2nd, 4th and 6th for words to exclude.
I had two ways of solving the problem so I could compare results to make sure they tally.
The Simple solution takes two thirds of the time of the Streaming, so it is faster (when you can fit everything in memory).
The Streaming solution takes 1/150th of the memory of the Simple solution and should stay pretty constant in memory use for larger file sizes whereas the Simple solution will need ever more memory.
The Streaming solution took less than 5 seconds to run with the 11 meg file so could take a few hours for 11 Gig files.
I intend to write a blog post on this which will show my final code.
Upvotes: 1
Reputation: 479
I realize this is coming a bit late now, but just to throw my answer out there.
I hadn't noticed bphi
's answer before I wrote mine. The idea is almost identical, but this answer comes out sorted.
from collections import Counter, defaultdict
s_myconcepts = set(myconcepts)
s_hatedconcepts = set(hatedconcepts)
myoutput = defaultdict(list)
for _, item in mylist:
item = set(item)
for concept in item.intersection(s_myconcepts):
myoutput[concept].extend(item - s_hatedconcepts)
myoutput = {k: Counter(v).most_common() for k, v in myoutput.items()}
Upvotes: 1
Reputation: 15872
I have tried to make it fast, avoided some repeated loops. Please check if this speeds things up.
from itertools import chain
from collections import Counter, defaultdict
database = defaultdict(set)
output = {}
# created a map for different concepts, so we only search the indices where a certain concept is
for index, (_, concepts) in enumerate(mylist):
for concept in concepts:
database[concept].add(index)
for concept in myconcepts:
search_indices = database[concept]
all_counts = Counter(chain.from_iterable(mylist[i][1] for i in search_indices))
for hc in hatedconcepts:
if hc in all_counts: all_counts.pop(hc)
output[concept] = sorted(all_counts.items(), key=lambda x: x[1], reverse=True)
Upvotes: 2
Reputation: 3185
As other comments and answers have indicated, this operation is better handled by Spark or a database. That said, here's my take on it, I introduced some sets operations and minimized repeated loops.
from collections import defaultdict
def get_counts(lst, concepts, hated_concepts):
result = {concept: defaultdict(int) for concept in concepts}
concepts_set = set(concepts)
hated_concepts_set = set(hated_concepts)
for _, inner_list in lst:
# ignore hated concepts
relevant = set(inner_list).difference(hated_concepts_set)
# determine which concepts need to be updated
to_update = relevant.intersection(concepts_set)
for concept in to_update:
for word in relevant:
result[concept][word] += 1
return result
Output is below. You mention the output "must be sorted", but it's unclear to me what the desired sorting is. Some timing tests indicate this is 9x faster than the code you provided on your sample data.
{
'my cat': defaultdict(<class 'int'>, {'my cat': 2, 'fish': 1, 'little dog': 2}),
'little dog': defaultdict(<class 'int'>, {'my cat': 2, 'fish': 2, 'little dog': 3, 'duck': 1})
}
emj_functn avg 0.9355s
get_counts avg 0.1141s
Performance testing script:
import random
import string
import time
words = list({
''.join(random.choice(string.ascii_lowercase) for _ in range(5))
for _ in range(1000)
})
test_list = [[random.randint(1e6, 1e7), [random.choice(words) for _ in range(100)]] for _ in range(1000)]
test_concepts = [random.choice(words) for _ in range(100)]
test_hated_concepts = [random.choice(words) for _ in range(50)]
def emj_functn(lst, concepts, hated_concepts):
...
def get_counts(lst, concepts, hated_concepts):
...
TEST_CASES = 10
start_time = time.time()
for _ in range(TEST_CASES):
emj_functn(test_list, test_concepts, test_hated_concepts)
end_time = time.time()
avg = (end_time - start_time) / TEST_CASES
print(f'emj_functn avg {avg:.4}s')
start_time = time.time()
for _ in range(TEST_CASES):
get_counts(test_list, test_concepts, test_hated_concepts)
end_time = time.time()
avg = (end_time - start_time) / TEST_CASES
print(f'get_counts avg {avg:.4}s')
Upvotes: 3
Reputation: 771
try this:
from collections import Counter
req={}
for i in myconcepts:
x=sum([j[1] for j in mylist if i in j[1]],[])
x=[i for i in x if i not in hatedconcepts]
req[i]=dict(Counter(x))
print(req)
output:
{'my cat': {'my cat': 2, 'little dog': 2, 'fish': 1}, 'little dog': {'my cat': 2, 'little dog': 3, 'fish': 2, 'duck': 1}}
Upvotes: 1
Reputation: 1086
I would suggest going with Apache Spark or Apache Hadoop if you have a word-count-like example, in fact, these frameworks specialize in that.
Both have frameworks have can be used with python.
But if you want to stick with only python.
I would suggest parallelization:
Split my_list
into n
sublist my_sub_lists
my_list = ["my cat", "little dog", "fish", "rat", "my cat","little dog" ]
# split my_list into n=2 sublists
my_sub_lists = [["my cat", "little dog", "fish"], ["rat", "my cat","little dog"]]
Compute item counts for my_sub_lists
in parallel
Process 1: Counter(["my cat", "little dog", "fish"])
Process 2 : Counter("rat", "my cat","little dog"])
You would get some intermediate aggregation. my_sub_counts
my_sub_counts = [{"my cat":1, "little dog":1, "fish":1}, {"rat":1, "my cat":1,"little dog":1}]
Merge intermediate result to get the final item count.
result = {"my cat":2, "little dog":2, "fish":1, "rat":1}
Combining intermediate aggregation would be easier since it would be smaller.
Upvotes: 1