Select rows of pandas dataframe based on column values with duplicates

Question

I have dataframe like this

   customer_id  some_data
0      1            A
1      2            B
2      3            C  
3      1            D

and a list of customer_id values with duplicates for example [1,2,2]. Based on these values I want to get a dataframe in which customer_id equal to a value in the list, but if I get a duplicate value in the list, I want duplicate values in rows, for example for [1,2,2] my output should be

   customer_id  some_data
 0     1            A
 3     1            D
 1     2            B
 1     2            B

I tried something like this

df_new= df[df.customer_id == list[0]]
for i in range(1,len(list)):
    temp = df[df.customer_id == list[i]]
    df_new = pd.concat([df_new, temp])

This code works but mine df is large so the working time of this code is large, can I optimize it somehow?

adhg · Accepted Answer

create another dummy dataframe with the ids you wish to have:

df2 = pd.DataFrame({'customer_id':[1,2,2]})

    customer_id
0   1
1   2
2   2

and merge it with the give dataframe:

df.merge(df2)

desired result:

 customer_id    some_data
0   1            A
1   1            D
2   2            B
3   2            B

Most importantly: your code will work but its very slow for large data. The reason for your long processing time is your for loop! to optimize it you should always aim at vectorizing.

Select rows of pandas dataframe based on column values with duplicates

Answers (1)

Related Questions