Reputation: 105
I have a semi-complex for loop which has to be applied row by row (I guess). I've read the information in e.g. 1. However, I cannot wrap my head around how I would create a dictionary using these options. Running the current loop on the dataset (120k rows and large lists inside each row)
Can someone give me a pointer / hint on how you would make it run without giving me a out of memory error killing python although featuring 100 GB RAM?
Current simplified code example:
import modin.pandas as pd
import numpy as np
import ray
ray.init()
df = pd.read_csv("file.csv")
dict_res = {}
for index, row in df.iterrows():
list_items = row['listed_items']
length = len(list_items)
for i in range(0, length):
for j in range(i+1, length):
key_str = "{},{}".format(list_items[i], list_items[j])
if key_str in dict_res:
dict_res[key_str] += (1/(length-1))
else:
dict_res[key_str] = (1/(length-1))
Example for df["listed_items"] row entries and dict_res as a result:
row1 = [100000, 200000, 421563]
row2 = [500, 453100, 442211, ...]
dict_res = {
"100000,200000" : 0.5
"100000,421563" : 0.5
"200000,421563" : 0.5
... }
ADDITION: For simpler testing I provide a file testfile.csv:
prop,items
XY108,"[9929, 102010, 301352, 521008]"
XY109,"[382, 396, 456, 639, 883, 1291, 1333, 1969, 9929, 102010, 11457, 12425, 15770]"
We get to the df used in the example by running:
from collections import Counter
import modin.pandas as pd
import ray
ray.init()
df = pd.read_csv("testfile.csv")
def str_to_list(list_str):
return [int(x) for x in list_str.strip('[]').split(',')]
df['items'] = df['items'].apply(str_to_list)
Upvotes: 0
Views: 448
Reputation: 176
You don't have to iterate over the rows for this calculation. You can use apply to transform each list of items into a series of counts weighted by the length of the items, then sum the counts.
from collections import Counter
import pandas as pd
df = pd.DataFrame(
{'listed_items': [
['a', 'b', 'c'],
['d', 'e'],
['a', 'b']
]
}
)
def items_to_weighted_sums(items : list) -> pd.Series:
counts = Counter()
for i in range(0, len(items)):
for j in range(i+1, len(items)):
counts[f"{items[i]}_{items[j]}"] += 1
return pd.Series(counts) / (len(items) - 1)
# prints
# {'a_b': 1.5, 'a_c': 0.5, 'b_c': 0.5, 'd_e': 1.0}
print(df['listed_items'].apply(items_to_weighted_sums).sum().to_dict())
Upvotes: 1