Reputation: 312
I am trying to replace multiple rows of pandas dataframe, with values from another dataframe.
Supposed I have 10,000 rows of customer_id in my dataframe df1
and I want to replace these customer_id with 3,000 values from df2
.
For the sake of illustration, let's generate the dataframes (below).
Say these 10 rows in df1
represent 10,000 rows, and the 3 rows from df2
represent 3,000 values.
import numpy as np
import pandas as pd
np.random.seed(42)
# Create df1 with unique values
arr1 = np.arange(100,200,10)
np.random.shuffle(arr1)
df1 = pd.DataFrame(data=arr1,
columns=['customer_id'])
# Create df2 for new unique_values
df2 = pd.DataFrame(data = [1800, 1100, 1500],
index = [180, 110, 150], # this is customer_id column on df1
columns = ['customer_id_new'])
I want to replace 180 with 1800, 110 with 1100, and 150 with 1500.
I know we can do below ...
# Replace multiple values
replace_values = {180 : 1800, 110 : 1100, 150 : 1500 }
df1_replaced = df1.replace({'customer_id': replace_values})
And it works fine if I only have a few lines...
But if I have thousands of lines that I need to replace, how could I do this without typing out what values I want to change one at a time?
EDIT: To clarify, I don't need to use replace
. Anything that could replace those values in df1 from values in df2 in the fastest most efficient way is ok.
Upvotes: 1
Views: 1328
Reputation: 41
In my opinion, apart from trying out useful answers mentioned above, you may try parallelising your data-frame in-case you have multi-core processor.
For example:
import pandas as pd, numpy as np, seaborn as sns
from multiprocessing import Pool
num_partitions = 10 #number of partitions to split data-frame
num_cores = 4 #number of cores on your machine
iris = pd.DataFrame(sns.load_dataset('iris'))
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
In place of 'func' parameter, you may pass your replace method. Please let me know if it helps. In case of any error, do comment.
Thanks!
Upvotes: 1
Reputation: 2188
df1['customer_id'] = df1['customer_id'].replace(df2['customer_id_new'])
Alternatively, you can do it in place.
df1['customer_id'].replace(df2['customer_id_new'], inplace=True)
Upvotes: 3
Reputation: 153480
You can try this, using map
with a pd.Series:
df1['customer_id'] = df1['customer_id'].map(df2.squeeze()).fillna(df1['customer_id'])
or
df1['customer_id'] = df1['customer_id'].map(df2['customer_id_new']).fillna(df1['customer_id'])
Output:
customer_id
0 1800.0
1 1100.0
2 1500.0
3 100.0
4 170.0
5 120.0
6 190.0
7 140.0
8 130.0
9 160.0
Upvotes: 2
Reputation: 51395
Going with your original method using replace
, you can simplify it with to_dict
to create your mapping dictionary without having to do it manually:
df1["customer_id"] = df1["customer_id"].replace(df2["customer_id_new"].to_dict())
>>> df1
customer_id
0 1800
1 1100
2 1500
3 100
4 170
5 120
6 190
7 140
8 130
9 160
Upvotes: 1