Reputation: 11
I have two dataframes and have a code to extract some data from one of the dataframes and add to the other dataframe:
sales= pd.read_excel("data.xlsx", sheet_name = 'sales', header = 0)
born= pd.read_excel("data.xlsx", sheet_name = 'born', header = 0)
bornuni = born.number.unique()
for babies in bornuni:
datafame = born[born["id"]==number]
for i, r in sales.iterrows():
if r["number"] == babies:
sales.loc[i,'ini_weight'] = datafame["weight"].iloc[0]
sales.loc[i,'ini_date'] = datafame["date of birth"].iloc[0]
else:
pass
this is pretty inefficient with bigger data sets so I want to parallelize this code but I don´t have a clue how to do it. Any help would be great. Here is a link to a mock dataset.
Upvotes: 1
Views: 74
Reputation: 9597
So before worrying about parallelizing, I can't help but notice that you're using lots of for
loops to deal with the dataframes. Dataframes are pretty fast when you use their vectorized capabilities.
I see a lot of inefficient use of pandas here, so maybe we first fix that and then worry about throwing more CPU cores at it.
It seems to me you want to accomplish the following:
For each unique baby id number in the born
dataframe, you want to update the ini_weight
and ini_date
fields of the corresponding entry in the sales
dataframe.
There's a good chance that you can use some dataframe merging / joining to help you with that, as well as using the pivot table functionality:
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
I strongly suggest you take a look at those, try using the ideas from these articles, and then reframe your question in terms of these operations, because as you correctly notice, looping over all the rows repeatedly to find the row with some matching index is very inefficient.
Upvotes: 2