How to make this for loop ready for Pandas/Modin/Ray

Question

I have a semi-complex for loop which has to be applied row by row (I guess). I've read the information in e.g. 1. However, I cannot wrap my head around how I would create a dictionary using these options. Running the current loop on the dataset (120k rows and large lists inside each row)

Can someone give me a pointer / hint on how you would make it run without giving me a out of memory error killing python although featuring 100 GB RAM?

Current simplified code example:

import modin.pandas as pd
import numpy as np
import ray

ray.init()

df = pd.read_csv("file.csv")

dict_res = {}
for index, row in df.iterrows():
    list_items = row['listed_items']
    length = len(list_items)
    for i in range(0, length):
        for j in range(i+1, length):
            key_str = "{},{}".format(list_items[i], list_items[j])
            if key_str in dict_res:
                dict_res[key_str] += (1/(length-1))
            else:
                dict_res[key_str] = (1/(length-1))

Example for df["listed_items"] row entries and dict_res as a result:

row1 = [100000, 200000, 421563]

row2 = [500, 453100, 442211, ...]


dict_res = {
"100000,200000" : 0.5
"100000,421563" : 0.5
"200000,421563" : 0.5
... }

ADDITION: For simpler testing I provide a file testfile.csv:

prop,items
XY108,"[9929, 102010, 301352, 521008]"
XY109,"[382, 396, 456, 639, 883, 1291, 1333, 1969, 9929, 102010, 11457, 12425, 15770]"

We get to the df used in the example by running:

from collections import Counter
import modin.pandas as pd
import ray

ray.init()

df = pd.read_csv("testfile.csv")

def str_to_list(list_str):
    return [int(x) for x in list_str.strip('[]').split(',')]

df['items'] = df['items'].apply(str_to_list)

How to make this for loop ready for Pandas/Modin/Ray

Answers (1)

Related Questions