Reputation: 147
I have the following code:
new_df = pd.DataFrame(columns=df.columns)
for i in list:
temp = df[df["customer id"]==i]
new_df = new_df.append(temp)
where list
is a list of customer id's for the customers that meet a criteria chosen before. I use the temp
dataframe because there are multiple rows for the same customer.
I consider that I know how to code, but I have never learnt how to code for big data efficiency. In this case, the df
has around 3 million rows and list
contains around 100,000 items. This code ran for more than 24h and it was still not done, so I need to ask, am I doing something terribly wrong? Is there a way to make this code more efficient?
Upvotes: 0
Views: 56
Reputation: 1541
list
is a type in Python. You should avoid naming your variables with built-in types or functions. I simulated the problem with 3 million rows and a list of customer id of size 100000. It took only a few seconds using isin.
new_df = df[ df['customer id'].isin(customer_list) ]
Upvotes: 1
Reputation: 8768
You can try this code below, which should make things faster.
new_df = df.loc[df['customer id'].isin(list)]
Upvotes: 1