Reputation: 91
I would like to create on my Dataframe (Global_Dataset) a new column (Col_val) based on the other Dataframe (List_Data). I need a faster code because I have a dataset of 2 million samples and List_data contains 50000 samples.
Col_Val must contain the value of column Value according to Col_Key
List_Data:
id Key Value
1 5 0
2 7 1
3 9 2
Global_Dataset:
id Col_Key Col_Val
1 9 2
2 5 0
3 9 2
4 7 1
5 7 1
6 5 0
7 9 2
8 7 1
9 9 2
10 5 0
I have tried this code but it needs a long time to be executed. Is there any other faster way for achieving my goal?
Col_Val = []
for i in range (len(List_Data)):
for j in range (len(Global_Data)):
if List_Data.get_value(i, "Key") == Global_Data.get_value(j, 'Col_Key') :
Col_Val.append(List_Data.get_value(i, 'Value'))
Global_Data['Col_Val'] = Col_Val
PS: I have tried loc and iloc instead of get_value but it works very slow
Upvotes: 1
Views: 42
Reputation: 5955
There is no reason to loop through anything, either manually or with iterrows
. If I understand your problem, this should be a simple merge operation.
df
Key Value
id
1 5 0
2 7 1
3 9 2
global_df
Col_Key
id
1 9
2 5
3 9
4 7
5 7
6 5
7 9
8 7
9 9
10 5
global_df.reset_index()\
.merge(df, left_on='Col_Key', right_on='Key')\
.drop('Key', axis=1)\
.set_index('id')\
.sort_index()
Col_Key Value
id
1 9 2
2 5 0
3 9 2
4 7 1
5 7 1
6 5 0
7 9 2
8 7 1
9 9 2
10 5 0
Note that the essence of this is the global_df.merge(...)
, but the extra operations are to keep the original indexing and remove unwanted extra columns. I encourage you to try each step individually to see the results.
Upvotes: 2
Reputation: 7361
Try this:
data_dict = {key : value for key, value in zip(List_Data['Key'], List_Data['Value'])}
Global_Data['Col_Val'] = pd.Series([data_dict[key] for key in Global_Data['Col_Key']])
I don't know how long it will takes on your machine with the amount of data you need to handle, but it should be faster of what you are using now.
You could also generate the dictionary with data_dict = {row['Key'] : row['Value'] for _, row in list_data.iterrows()}
but on my machine is slower than what I proposed above.
It works under the assumption that all the keys in Global_Data['Col_Keys']
are present in List_Data['Key']
, otherwise you will get a KeyError
.
Upvotes: 2