user12195705
user12195705

Reputation: 147

Make piece of code efficient for big data

I have the following code:

new_df = pd.DataFrame(columns=df.columns)
for i in list:
    temp = df[df["customer id"]==i]
    new_df = new_df.append(temp)

where list is a list of customer id's for the customers that meet a criteria chosen before. I use the temp dataframe because there are multiple rows for the same customer.

I consider that I know how to code, but I have never learnt how to code for big data efficiency. In this case, the df has around 3 million rows and list contains around 100,000 items. This code ran for more than 24h and it was still not done, so I need to ask, am I doing something terribly wrong? Is there a way to make this code more efficient?

Upvotes: 0

Views: 56

Answers (2)

Pramote Kuacharoen
Pramote Kuacharoen

Reputation: 1541

list is a type in Python. You should avoid naming your variables with built-in types or functions. I simulated the problem with 3 million rows and a list of customer id of size 100000. It took only a few seconds using isin.

new_df = df[ df['customer id'].isin(customer_list) ]

Upvotes: 1

rhug123
rhug123

Reputation: 8768

You can try this code below, which should make things faster.

new_df = df.loc[df['customer id'].isin(list)]

Upvotes: 1

Related Questions