Reputation: 383
I have a dataframe scaled_data
that has 333 columns with a date column and integers as indices like
date features_0 \ features_332
0 2016-03-02 0.859325 0.939342
1 2016-03-03 0.930617 0.940042
2 2016-03-04 0.929344 0.936969
3 2016-03-07 0.936346 0.927318
4 2016-03-08 0.935073 0.930111
... ... ... ...
1032 2020-12-10 0.006365 0.037186
1033 2020-12-14 0.005092 0.044053
1034 2020-12-15 0.005092 0.047346
1035 2020-12-16 0.002546 0.051075
1036 2020-12-17 0.000000 0.051563
[1037 rows x 334 columns]
I also have a DataFrame returns[['AN8068571086']]
that is a part of another DataFrame returns
, this time indexed by DateTimeIndex
.
AN8068571086
2016-03-02 -0.010948
2016-03-03 -0.008819
2016-03-04 -0.008740
2016-03-07 0.003754
2016-03-08 -0.021088
... ...
2020-12-24 -0.011732
2020-12-28 -0.011401
2020-12-29 0.002753
2020-12-30 0.012435
2020-12-31 -0.007396
[1219 rows x 1 columns]
Note that some of the rows of scaled_data
is missing/cleaned but all its rows should contain in the rows of returns[['AN8068571086']]
. Now what I would like is to combine returns[['AN8068571086']]
into scaled_data
such that any rows in scaled_data
that wasn't in returns[['AN8068571086']]
will be removed after the merge i,e, an inner join?
Thanks in advance!
date features_0 features_1 features_2 features_3 features_4 \
0.0 2016-03-02 0.859325 0.000000 0.529731 0.146283 0.025910
1.0 2016-03-03 0.930617 0.333333 0.540146 0.132307 0.030436
2.0 2016-03-04 0.929344 0.000000 0.537209 0.133358 0.030373
3.0 2016-03-07 0.936346 0.000000 0.536281 0.134357 0.031109
4.0 2016-03-08 0.935073 0.000000 0.535727 0.135017 0.030369
.. ... ... ... ... ... ...
NaN 2020-12-24 NaN NaN NaN NaN NaN
NaN 2020-12-28 NaN NaN NaN NaN NaN
NaN 2020-12-29 NaN NaN NaN NaN NaN
NaN 2020-12-30 NaN NaN NaN NaN NaN
NaN 2020-12-31 NaN NaN NaN NaN NaN
features_5 features_6 features_7 features_8 ... features_324 \
0.0 0.000239 0.392884 0.168576 0.202672 ... 0.648791
1.0 0.000449 0.485534 0.080250 0.198381 ... 0.716234
2.0 0.000454 0.485841 0.080611 0.201828 ... 0.790423
3.0 0.000525 0.388892 0.153548 0.205674 ... 0.838849
4.0 0.000441 0.473135 0.093709 0.207139 ... 0.844005
.. ... ... ... ... ... ...
NaN NaN NaN NaN NaN ... NaN
NaN NaN NaN NaN NaN ... NaN
NaN NaN NaN NaN NaN ... NaN
NaN NaN NaN NaN NaN ... NaN
NaN NaN NaN NaN NaN ... NaN
features_325 features_326 features_327 features_328 features_329 \
0.0 0.646830 0.395249 0.575381 0.465825 0.572961
1.0 0.590791 0.409932 0.579916 0.485360 0.610695
2.0 0.591864 0.374055 0.584935 0.520497 0.590541
3.0 0.578809 0.169337 0.582494 0.482946 0.593930
4.0 0.616921 0.476683 0.618526 0.528613 0.632306
.. ... ... ... ... ...
NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN
features_330 features_331 features_332 AN8068571086
0.0 0.848330 0.646147 0.939342 -0.010948
1.0 0.820516 0.650472 0.940042 -0.008819
2.0 0.792425 0.622573 0.936969 -0.008740
3.0 0.749944 0.621197 0.927318 0.003754
4.0 0.712940 0.653823 0.930111 -0.021088
.. ... ... ... ...
NaN NaN NaN NaN -0.011732
NaN NaN NaN NaN -0.011401
NaN NaN NaN NaN 0.002753
NaN NaN NaN NaN 0.012435
NaN NaN NaN NaN -0.007396
[1219 rows x 335 columns]
Upvotes: 0
Views: 108
Reputation: 120559
Use pd.merge
with how='left'
parameter:
# Ensure scaled_data is a DatetimeIndex
scaled_data['date'] = pd.to_datetime(scaled_data['date'])
# or returns index is a string
returns = returns.reindex(returns.index.date)
out = pd.merge(scaled_data, returns[['AN8068571086']],
left_on='date', right_index=True, how='left')
Example:
>>> scaled_data
date features_0
0 2016-03-02 0.859325
0 2016-03-02 0.859325
1 2016-03-03 0.930617
2 2016-03-04 0.929344
>>> returns
AN8068571086
2016-03-02 -0.010948
2016-03-03 -0.008819
2016-03-04 -0.008740
2016-03-07 0.003754
2016-03-08 -0.021088
>>> out
date features_0 AN8068571086
0 2016-03-02 0.859325 -0.010948
0 2016-03-02 0.859325 -0.010948
1 2016-03-03 0.930617 -0.008819
2 2016-03-04 0.929344 -0.008740
Upvotes: 1