Make piece of code efficient for big data

Question

I have the following code:

new_df = pd.DataFrame(columns=df.columns)
for i in list:
    temp = df[df["customer id"]==i]
    new_df = new_df.append(temp)

where list is a list of customer id's for the customers that meet a criteria chosen before. I use the temp dataframe because there are multiple rows for the same customer.

I consider that I know how to code, but I have never learnt how to code for big data efficiency. In this case, the df has around 3 million rows and list contains around 100,000 items. This code ran for more than 24h and it was still not done, so I need to ask, am I doing something terribly wrong? Is there a way to make this code more efficient?

Pramote Kuacharoen · Accepted Answer

list is a type in Python. You should avoid naming your variables with built-in types or functions. I simulated the problem with 3 million rows and a list of customer id of size 100000. It took only a few seconds using isin.

new_df = df[ df['customer id'].isin(customer_list) ]

Make piece of code efficient for big data

Answers (2)

Related Questions