Brian
Brian

Reputation: 13573

How to merge two DataFrames with DatetimeIndex preserved in pandas?

I have 2 DataFrames, df1, and df2.

df1 has the following contents:

           Adj Close             Close              High               Low  \
                GBTC         QQQ  GBTC         QQQ  GBTC         QQQ  GBTC   
Date                                                                         
2019-01-29      4.02  159.342209  4.02  161.570007  4.07  163.240005  3.93   
2019-01-30      4.06  163.395538  4.06  165.679993  4.09  166.279999  4.01   
2019-01-31      3.99  165.841370  3.99  168.160004  4.06  168.990005  3.93   
2019-02-01      4.02  165.141129  4.02  167.449997  4.07  168.600006  3.93   
2019-02-04      3.96  167.192474  3.96  169.529999  4.00  169.529999  3.93   
...              ...         ...   ...         ...   ...         ...   ...   
2019-02-25      4.65  171.127441  4.65  173.520004  4.78  174.660004  4.50   
2019-02-26      4.36  171.304947  4.36  173.699997  4.74  174.250000  4.36   
2019-02-27      4.30  171.196487  4.30  173.589996  4.50  173.800003  4.30   
2019-02-28      4.46  170.802002  4.46  173.190002  4.65  173.809998  4.40   
2019-03-01      4.58  171.985443  4.58  174.389999  4.64  174.649994  4.45   

                         Open               Volume            
                   QQQ   GBTC         QQQ     GBTC       QQQ  
Date                                                          
2019-01-29  160.990005  3.970  163.199997   975200  30784200  
2019-01-30  162.889999  4.035  163.399994   770700  41346500  
2019-01-31  166.470001  4.040  166.699997  1108700  37258400  
2019-02-01  166.990005  4.000  167.330002   889100  32143700  
2019-02-04  167.330002  3.990  167.479996   871800  26718800  
...                ...    ...         ...      ...       ...  
2019-02-25  173.399994  4.625  174.210007  2891200  32608800  
2019-02-26  172.809998  4.625  173.100006  2000100  21939700  
2019-02-27  171.759995  4.400  172.899994  1537000  25162000  
2019-02-28  172.699997  4.420  173.050003  1192600  25085500  
2019-03-01  173.179993  4.470  174.440002   948500  31431200  

[23 rows x 12 columns]

And here's the contents of df2:

           Adj Close             Close              High               Low  \
                GBTC         QQQ  GBTC         QQQ  GBTC         QQQ  GBTC   
Date                                                                         
2019-02-25      4.65  171.127441  4.65  173.520004  4.78  174.660004  4.50   
2019-02-26      4.36  171.304947  4.36  173.699997  4.74  174.250000  4.36   
2019-02-27      4.30  171.196487  4.30  173.589996  4.50  173.800003  4.30   
2019-02-28      4.46  170.802002  4.46  173.190002  4.65  173.809998  4.40   
2019-03-01      4.58  171.985443  4.58  174.389999  4.64  174.649994  4.45   
...              ...         ...   ...         ...   ...         ...   ...   
2019-03-28      4.54  176.171432  4.54  178.309998  4.68  178.979996  4.51   
2019-03-29      4.78  177.505249  4.78  179.660004  4.83  179.830002  4.55   
2019-04-01      4.97  179.856705  4.97  182.039993  5.03  182.259995  4.85   
2019-04-02      5.74  180.538437  5.74  182.729996  5.83  182.910004  5.52   
2019-04-03      6.19  181.575836  6.19  183.779999  6.59  184.919998  5.93   

                         Open                Volume            
                   QQQ   GBTC         QQQ      GBTC       QQQ  
Date                                                           
2019-02-25  173.399994  4.625  174.210007   2891200  32608800  
2019-02-26  172.809998  4.625  173.100006   2000100  21939700  
2019-02-27  171.759995  4.400  172.899994   1537000  25162000  
2019-02-28  172.699997  4.420  173.050003   1192600  25085500  
2019-03-01  173.179993  4.470  174.440002    948500  31431200  
...                ...    ...         ...       ...       ...  
2019-03-28  177.240005  4.650  178.360001   2104400  30368200  
2019-03-29  178.589996  4.710  179.690002   2937400  35205500  
2019-04-01  180.770004  4.850  181.509995   2733600  30969500  
2019-04-02  181.779999  5.660  182.240005   6062000  22645200  
2019-04-03  183.210007  5.930  183.759995  10002400  31633500  

[28 rows x 12 columns]

As you can see from the above, df1 and df2 have overlapping Dates.

How can I create a merged DataFrame df that contains dates from 2019-01-29 to 2019-04-03 with no overlapping Date?

I've tried running df = df1.merge(df2, how='outer'). However, this command returns a DataFrame with Date removed, which is not something desirable.

> df
  Adj Close             Close              High               Low  \
        GBTC         QQQ  GBTC         QQQ  GBTC         QQQ  GBTC   
0       4.02  159.342209  4.02  161.570007  4.07  163.240005  3.93   
1       4.06  163.395538  4.06  165.679993  4.09  166.279999  4.01   
2       3.99  165.841370  3.99  168.160004  4.06  168.990005  3.93   
3       4.02  165.141129  4.02  167.449997  4.07  168.600006  3.93   
4       3.96  167.192474  3.96  169.529999  4.00  169.529999  3.93   
..       ...         ...   ...         ...   ...         ...   ...   
41      4.54  176.171432  4.54  178.309998  4.68  178.979996  4.51   
42      4.78  177.505249  4.78  179.660004  4.83  179.830002  4.55   
43      4.97  179.856705  4.97  182.039993  5.03  182.259995  4.85   
44      5.74  180.538437  5.74  182.729996  5.83  182.910004  5.52   
45      6.19  181.575836  6.19  183.779999  6.59  184.919998  5.93   

                 Open                Volume            
           QQQ   GBTC         QQQ      GBTC       QQQ  
0   160.990005  3.970  163.199997    975200  30784200  
1   162.889999  4.035  163.399994    770700  41346500  
2   166.470001  4.040  166.699997   1108700  37258400  
3   166.990005  4.000  167.330002    889100  32143700  
4   167.330002  3.990  167.479996    871800  26718800  
..         ...    ...         ...       ...       ...  
41  177.240005  4.650  178.360001   2104400  30368200  
42  178.589996  4.710  179.690002   2937400  35205500  
43  180.770004  4.850  181.509995   2733600  30969500  
44  181.779999  5.660  182.240005   6062000  22645200  
45  183.210007  5.930  183.759995  10002400  31633500  

[46 rows x 12 columns]

It seems that I should find a way to merge df1.index and df2.index. Then add the merged DatetimeIndex to df.

For the convenience of debugging, you can run the following code to get the same data as mine.

import yfinance as yf

symbols = ['QQQ', 'GBTC']
df1 = yf.download(symbols, start="2019-01-29", end="2019-03-01")
df2 = yf.download(symbols, start="2019-02-25", end="2019-04-03")

Upvotes: 1

Views: 280

Answers (1)

mullinscr
mullinscr

Reputation: 1738

Taken from the docs:

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

So I believe that if you specify the index in the merge with on=Date, then you should be ok.

df1.merge(df2, how='outer', on='Date')

However, for the problem that you are trying to solve merge is note the correct tool. What you need to do is append the dataframes together and then remove the duplicated days:

df1.append(df2).drop_duplicates()

Upvotes: 1

Related Questions