Nirmine
Nirmine

Reputation: 91

How can I copy values from one dataframe to other dataframe fastly

I would like to create on my Dataframe (Global_Dataset) a new column (Col_val) based on the other Dataframe (List_Data). I need a faster code because I have a dataset of 2 million samples and List_data contains 50000 samples.

Col_Val must contain the value of column Value according to Col_Key

List_Data:

id     Key     Value
1       5        0
2       7        1
3       9        2

Global_Dataset:

id    Col_Key     Col_Val
1        9           2
2        5           0
3        9           2 
4        7           1
5        7           1
6        5           0
7        9           2
8        7           1
9        9           2
10       5           0

I have tried this code but it needs a long time to be executed. Is there any other faster way for achieving my goal?

Col_Val = []
for i in range (len(List_Data)):
    for j in range (len(Global_Data)):
        if List_Data.get_value(i, "Key") == Global_Data.get_value(j, 'Col_Key') :
            Col_Val.append(List_Data.get_value(i, 'Value'))

Global_Data['Col_Val'] = Col_Val

PS: I have tried loc and iloc instead of get_value but it works very slow

Upvotes: 1

Views: 42

Answers (2)

G. Anderson
G. Anderson

Reputation: 5955

There is no reason to loop through anything, either manually or with iterrows. If I understand your problem, this should be a simple merge operation.

df
    Key Value
id      
1   5   0
2   7   1
3   9   2

global_df
    Col_Key
   id   
1   9
2   5
3   9
4   7
5   7
6   5
7   9
8   7
9   9
10  5

global_df.reset_index()\
.merge(df, left_on='Col_Key', right_on='Key')\
.drop('Key', axis=1)\
.set_index('id')\
.sort_index()

    Col_Key Value
id      
1   9   2
2   5   0
3   9   2
4   7   1
5   7   1
6   5   0
7   9   2
8   7   1
9   9   2
10  5   0

Note that the essence of this is the global_df.merge(...), but the extra operations are to keep the original indexing and remove unwanted extra columns. I encourage you to try each step individually to see the results.

Upvotes: 2

Valentino
Valentino

Reputation: 7361

Try this:

data_dict = {key : value for key, value in zip(List_Data['Key'], List_Data['Value'])}
Global_Data['Col_Val'] = pd.Series([data_dict[key] for key in Global_Data['Col_Key']])

I don't know how long it will takes on your machine with the amount of data you need to handle, but it should be faster of what you are using now.
You could also generate the dictionary with data_dict = {row['Key'] : row['Value'] for _, row in list_data.iterrows()} but on my machine is slower than what I proposed above.

It works under the assumption that all the keys in Global_Data['Col_Keys'] are present in List_Data['Key'], otherwise you will get a KeyError.

Upvotes: 2

Related Questions