EmJ
EmJ

Reputation: 4608

How to efficiently get count for item in list of lists in python

I have three lists as follows.

mylist = [[5274919, ["my cat", "little dog", "fish", "rat"]], 
          [5274920, ["my cat", "parrot", "little dog"]], 
          [5274991, ["little dog", "fish", "duck"]]] 
myconcepts = ["my cat", "little dog"]
hatedconcepts = ["rat", "parrot"]

For each concept in myconcepts, I want to get the count every other concept connected to it using mylist. Then remove the hatedconcepts from it. So, my output should looks like as follows.

{"my cat": [("my cat", 2), ("little dog", 2), ("fish", 1)],
"little dog": [("little dog", 3), ("my cat", 2), ("fish", 2), ("duck", 1)]}

I was using this code to do it.

import collections
myoutput = []
for concept in myconcepts:
    mykeywords = []

    for item in mylist:
        if concept in item[1]:
            for mykeyword in item[1]:
                if mykeyword in hatedconcepts:
                    pass
                else:
                    mykeywords.append(mykeyword)
    if len(mykeywords) > 0:            
        sorted_keywords = collections.Counter(mykeywords).most_common()
        myoutput.append(tuple((concept, sorted_keywords)))
print(myoutput)

The output of the code is:

[('my cat', [('my cat', 2), ('little dog', 2), ('fish', 1)]), ('little dog', [('little dog', 3), ('my cat', 2), ('fish', 2), ('duck', 1)])]

However now I have a huge mylist with a size of 3GB and nearly 9000 myconcepts. The hatedconcepts count is only 20. It looks like it takes about two weeks to run using my current code. The main reason for this could be that my current program is O^3 which is not very efficient. Therefore, I am looking for ways to make my current program more efficient. I am even fine with pythonic solutions that even take 5-6 days to run. Please let me know your thoughts.

I have added a portion of mylist in: https://drive.google.com/file/d/1M3EhIRwwKwD3Kv4zDsmXaH1D73tx0eF3/view?usp=sharing just to get some idea how it looks like.

I am happy to provide more details if needed.

Upvotes: 17

Views: 2472

Answers (6)

Paddy3118
Paddy3118

Reputation: 4772

This is a nice question with a nice example dataset.

You should be able to get it down to run times of a couple of hours and using only a very small amount of memory by adopting a streaming accumulator architecture.

I note that the contents of the sublists are very regular just lists with repeated nesting structure and strings that don't contain any square brackets themselves. This allows following the following strategy:

Define a chunk_size that is ideally more than twice the number of characters in the text decribing asublist from the file, I used ~2Megs.

Read the next chunk_size chars from the file into a buffer.

Use a regular expression.finditer to extract successive sub_lists from the buffer

eval() each sub_list then run an Accumulator class instance on the result to increment the counts according to your inclusion/exclusion rules.

repeat for all sub_lists in the buffer

save the characters not used at the end of the buffer by finditer

repeat for the next buffer from the file, (prepending any characters not used at the end of the previous buffer)

Extract the running totals from the Accumulator instance when the file is exhausted.

I first found the phrases used the most in your 11meg example then used the 1st, 3rd and 5th for a test list of words to include and the 2nd, 4th and 6th for words to exclude.

I had two ways of solving the problem so I could compare results to make sure they tally.

  1. Simple: Just read in the whole file, eval it, then apply the Accumulaor to each sub-list
  2. Streaming: Chunked as described above.

The Simple solution takes two thirds of the time of the Streaming, so it is faster (when you can fit everything in memory).

The Streaming solution takes 1/150th of the memory of the Simple solution and should stay pretty constant in memory use for larger file sizes whereas the Simple solution will need ever more memory.

The Streaming solution took less than 5 seconds to run with the 11 meg file so could take a few hours for 11 Gig files.

I intend to write a blog post on this which will show my final code.

Upvotes: 1

Victor Uriarte
Victor Uriarte

Reputation: 479

I realize this is coming a bit late now, but just to throw my answer out there.

I hadn't noticed bphi's answer before I wrote mine. The idea is almost identical, but this answer comes out sorted.

from collections import Counter, defaultdict

s_myconcepts = set(myconcepts)
s_hatedconcepts = set(hatedconcepts)

myoutput = defaultdict(list)
for _, item in mylist:
    item = set(item)
    for concept in item.intersection(s_myconcepts):
        myoutput[concept].extend(item - s_hatedconcepts)

myoutput = {k: Counter(v).most_common() for k, v in myoutput.items()}

Upvotes: 1

Sayandip Dutta
Sayandip Dutta

Reputation: 15872

I have tried to make it fast, avoided some repeated loops. Please check if this speeds things up.

from itertools import chain
from collections import Counter, defaultdict

database = defaultdict(set)
output = {}

# created a map for different concepts, so we only search the indices where a certain concept is
for index, (_, concepts) in enumerate(mylist):
    for concept in concepts:
        database[concept].add(index)

for concept in myconcepts:
    search_indices = database[concept]
    all_counts = Counter(chain.from_iterable(mylist[i][1] for i in search_indices))
    for hc in hatedconcepts:
        if hc in all_counts: all_counts.pop(hc)
    output[concept] = sorted(all_counts.items(), key=lambda x: x[1], reverse=True)

Upvotes: 2

bphi
bphi

Reputation: 3185

As other comments and answers have indicated, this operation is better handled by Spark or a database. That said, here's my take on it, I introduced some sets operations and minimized repeated loops.

from collections import defaultdict

def get_counts(lst, concepts, hated_concepts):
    result = {concept: defaultdict(int) for concept in concepts}

    concepts_set = set(concepts)
    hated_concepts_set = set(hated_concepts)

    for _, inner_list in lst:
        # ignore hated concepts
        relevant = set(inner_list).difference(hated_concepts_set)

        # determine which concepts need to be updated
        to_update = relevant.intersection(concepts_set)

        for concept in to_update:
            for word in relevant:
                result[concept][word] += 1

    return result

Output is below. You mention the output "must be sorted", but it's unclear to me what the desired sorting is. Some timing tests indicate this is 9x faster than the code you provided on your sample data.

{
    'my cat': defaultdict(<class 'int'>, {'my cat': 2, 'fish': 1, 'little dog': 2}), 
    'little dog': defaultdict(<class 'int'>, {'my cat': 2, 'fish': 2, 'little dog': 3, 'duck': 1})
}

Performance Improvement

emj_functn avg 0.9355s
get_counts avg 0.1141s

Performance testing script:

import random
import string
import time

words = list({
    ''.join(random.choice(string.ascii_lowercase) for _ in range(5))
    for _ in range(1000)
})
test_list = [[random.randint(1e6, 1e7), [random.choice(words) for _ in range(100)]] for _ in range(1000)]
test_concepts = [random.choice(words) for _ in range(100)]
test_hated_concepts = [random.choice(words) for _ in range(50)]


def emj_functn(lst, concepts, hated_concepts):
    ...


def get_counts(lst, concepts, hated_concepts):
    ...


TEST_CASES = 10

start_time = time.time()
for _ in range(TEST_CASES):
    emj_functn(test_list, test_concepts, test_hated_concepts)
end_time = time.time()
avg = (end_time - start_time) / TEST_CASES
print(f'emj_functn avg {avg:.4}s')

start_time = time.time()
for _ in range(TEST_CASES):
    get_counts(test_list, test_concepts, test_hated_concepts)
end_time = time.time()
avg = (end_time - start_time) / TEST_CASES
print(f'get_counts avg {avg:.4}s')

Upvotes: 3

TBhavnani
TBhavnani

Reputation: 771

try this:

from collections import Counter
req={}
for i in myconcepts:
    x=sum([j[1] for j in mylist if i in j[1]],[])
    x=[i for i in x if i not in hatedconcepts]
    req[i]=dict(Counter(x))
print(req)

output:

{'my cat': {'my cat': 2, 'little dog': 2, 'fish': 1}, 'little dog': {'my cat': 2, 'little dog': 3, 'fish': 2, 'duck': 1}}

Upvotes: 1

Anurag Wagh
Anurag Wagh

Reputation: 1086

I would suggest going with Apache Spark or Apache Hadoop if you have a word-count-like example, in fact, these frameworks specialize in that.

Both have frameworks have can be used with python.

But if you want to stick with only python.

I would suggest parallelization:

Split my_list into n sublist my_sub_lists

my_list = ["my cat", "little dog", "fish", "rat", "my cat","little dog" ]
# split my_list into n=2 sublists
my_sub_lists = [["my cat", "little dog", "fish"], ["rat", "my cat","little dog"]]

Compute item counts for my_sub_lists in parallel

Process 1: Counter(["my cat", "little dog", "fish"])
Process 2 : Counter("rat", "my cat","little dog"])

You would get some intermediate aggregation. my_sub_counts

my_sub_counts = [{"my cat":1, "little dog":1, "fish":1}, {"rat":1, "my cat":1,"little dog":1}]

Merge intermediate result to get the final item count.

result = {"my cat":2, "little dog":2, "fish":1, "rat":1}

Combining intermediate aggregation would be easier since it would be smaller.

Upvotes: 1

Related Questions