Ranger
Ranger

Reputation: 105

How to make this for loop ready for Pandas/Modin/Ray

I have a semi-complex for loop which has to be applied row by row (I guess). I've read the information in e.g. 1. However, I cannot wrap my head around how I would create a dictionary using these options. Running the current loop on the dataset (120k rows and large lists inside each row)

Can someone give me a pointer / hint on how you would make it run without giving me a out of memory error killing python although featuring 100 GB RAM?

Current simplified code example:

import modin.pandas as pd
import numpy as np
import ray

ray.init()

df = pd.read_csv("file.csv")

dict_res = {}
for index, row in df.iterrows():
    list_items = row['listed_items']
    length = len(list_items)
    for i in range(0, length):
        for j in range(i+1, length):
            key_str = "{},{}".format(list_items[i], list_items[j])
            if key_str in dict_res:
                dict_res[key_str] += (1/(length-1))
            else:
                dict_res[key_str] = (1/(length-1))

Example for df["listed_items"] row entries and dict_res as a result:

row1 = [100000, 200000, 421563]

row2 = [500, 453100, 442211, ...]


dict_res = {
"100000,200000" : 0.5
"100000,421563" : 0.5
"200000,421563" : 0.5
... }

ADDITION: For simpler testing I provide a file testfile.csv:

prop,items
XY108,"[9929, 102010, 301352, 521008]"
XY109,"[382, 396, 456, 639, 883, 1291, 1333, 1969, 9929, 102010, 11457, 12425, 15770]"

We get to the df used in the example by running:

from collections import Counter
import modin.pandas as pd
import ray

ray.init()

df = pd.read_csv("testfile.csv")

def str_to_list(list_str):
    return [int(x) for x in list_str.strip('[]').split(',')]

df['items'] = df['items'].apply(str_to_list)

Upvotes: 0

Views: 448

Answers (1)

Mahesh Vashishtha
Mahesh Vashishtha

Reputation: 176

You don't have to iterate over the rows for this calculation. You can use apply to transform each list of items into a series of counts weighted by the length of the items, then sum the counts.

from collections import Counter
import pandas as pd

df = pd.DataFrame(
    {'listed_items': [
        ['a', 'b', 'c'],
        ['d', 'e'],
        ['a', 'b']
    ]
    }
)

def items_to_weighted_sums(items : list) -> pd.Series:
    counts = Counter()
    for i in range(0, len(items)):
        for j in range(i+1, len(items)):
            counts[f"{items[i]}_{items[j]}"] += 1
    return pd.Series(counts) / (len(items) - 1)

# prints 
# {'a_b': 1.5, 'a_c': 0.5, 'b_c': 0.5, 'd_e': 1.0}
print(df['listed_items'].apply(items_to_weighted_sums).sum().to_dict())

Upvotes: 1

Related Questions