Reputation: 13573
I have 2 DataFrames, df1
, and df2
.
df1 has the following contents:
Adj Close Close High Low \
GBTC QQQ GBTC QQQ GBTC QQQ GBTC
Date
2019-01-29 4.02 159.342209 4.02 161.570007 4.07 163.240005 3.93
2019-01-30 4.06 163.395538 4.06 165.679993 4.09 166.279999 4.01
2019-01-31 3.99 165.841370 3.99 168.160004 4.06 168.990005 3.93
2019-02-01 4.02 165.141129 4.02 167.449997 4.07 168.600006 3.93
2019-02-04 3.96 167.192474 3.96 169.529999 4.00 169.529999 3.93
... ... ... ... ... ... ... ...
2019-02-25 4.65 171.127441 4.65 173.520004 4.78 174.660004 4.50
2019-02-26 4.36 171.304947 4.36 173.699997 4.74 174.250000 4.36
2019-02-27 4.30 171.196487 4.30 173.589996 4.50 173.800003 4.30
2019-02-28 4.46 170.802002 4.46 173.190002 4.65 173.809998 4.40
2019-03-01 4.58 171.985443 4.58 174.389999 4.64 174.649994 4.45
Open Volume
QQQ GBTC QQQ GBTC QQQ
Date
2019-01-29 160.990005 3.970 163.199997 975200 30784200
2019-01-30 162.889999 4.035 163.399994 770700 41346500
2019-01-31 166.470001 4.040 166.699997 1108700 37258400
2019-02-01 166.990005 4.000 167.330002 889100 32143700
2019-02-04 167.330002 3.990 167.479996 871800 26718800
... ... ... ... ... ...
2019-02-25 173.399994 4.625 174.210007 2891200 32608800
2019-02-26 172.809998 4.625 173.100006 2000100 21939700
2019-02-27 171.759995 4.400 172.899994 1537000 25162000
2019-02-28 172.699997 4.420 173.050003 1192600 25085500
2019-03-01 173.179993 4.470 174.440002 948500 31431200
[23 rows x 12 columns]
And here's the contents of df2
:
Adj Close Close High Low \
GBTC QQQ GBTC QQQ GBTC QQQ GBTC
Date
2019-02-25 4.65 171.127441 4.65 173.520004 4.78 174.660004 4.50
2019-02-26 4.36 171.304947 4.36 173.699997 4.74 174.250000 4.36
2019-02-27 4.30 171.196487 4.30 173.589996 4.50 173.800003 4.30
2019-02-28 4.46 170.802002 4.46 173.190002 4.65 173.809998 4.40
2019-03-01 4.58 171.985443 4.58 174.389999 4.64 174.649994 4.45
... ... ... ... ... ... ... ...
2019-03-28 4.54 176.171432 4.54 178.309998 4.68 178.979996 4.51
2019-03-29 4.78 177.505249 4.78 179.660004 4.83 179.830002 4.55
2019-04-01 4.97 179.856705 4.97 182.039993 5.03 182.259995 4.85
2019-04-02 5.74 180.538437 5.74 182.729996 5.83 182.910004 5.52
2019-04-03 6.19 181.575836 6.19 183.779999 6.59 184.919998 5.93
Open Volume
QQQ GBTC QQQ GBTC QQQ
Date
2019-02-25 173.399994 4.625 174.210007 2891200 32608800
2019-02-26 172.809998 4.625 173.100006 2000100 21939700
2019-02-27 171.759995 4.400 172.899994 1537000 25162000
2019-02-28 172.699997 4.420 173.050003 1192600 25085500
2019-03-01 173.179993 4.470 174.440002 948500 31431200
... ... ... ... ... ...
2019-03-28 177.240005 4.650 178.360001 2104400 30368200
2019-03-29 178.589996 4.710 179.690002 2937400 35205500
2019-04-01 180.770004 4.850 181.509995 2733600 30969500
2019-04-02 181.779999 5.660 182.240005 6062000 22645200
2019-04-03 183.210007 5.930 183.759995 10002400 31633500
[28 rows x 12 columns]
As you can see from the above, df1
and df2
have overlapping Date
s.
How can I create a merged DataFrame df
that contains dates from 2019-01-29
to 2019-04-03
with no overlapping Date?
I've tried running df = df1.merge(df2, how='outer')
. However, this command returns a DataFrame with Date
removed, which is not something desirable.
> df
Adj Close Close High Low \
GBTC QQQ GBTC QQQ GBTC QQQ GBTC
0 4.02 159.342209 4.02 161.570007 4.07 163.240005 3.93
1 4.06 163.395538 4.06 165.679993 4.09 166.279999 4.01
2 3.99 165.841370 3.99 168.160004 4.06 168.990005 3.93
3 4.02 165.141129 4.02 167.449997 4.07 168.600006 3.93
4 3.96 167.192474 3.96 169.529999 4.00 169.529999 3.93
.. ... ... ... ... ... ... ...
41 4.54 176.171432 4.54 178.309998 4.68 178.979996 4.51
42 4.78 177.505249 4.78 179.660004 4.83 179.830002 4.55
43 4.97 179.856705 4.97 182.039993 5.03 182.259995 4.85
44 5.74 180.538437 5.74 182.729996 5.83 182.910004 5.52
45 6.19 181.575836 6.19 183.779999 6.59 184.919998 5.93
Open Volume
QQQ GBTC QQQ GBTC QQQ
0 160.990005 3.970 163.199997 975200 30784200
1 162.889999 4.035 163.399994 770700 41346500
2 166.470001 4.040 166.699997 1108700 37258400
3 166.990005 4.000 167.330002 889100 32143700
4 167.330002 3.990 167.479996 871800 26718800
.. ... ... ... ... ...
41 177.240005 4.650 178.360001 2104400 30368200
42 178.589996 4.710 179.690002 2937400 35205500
43 180.770004 4.850 181.509995 2733600 30969500
44 181.779999 5.660 182.240005 6062000 22645200
45 183.210007 5.930 183.759995 10002400 31633500
[46 rows x 12 columns]
It seems that I should find a way to merge df1.index
and df2.index
. Then add the merged DatetimeIndex to df
.
For the convenience of debugging, you can run the following code to get the same data as mine.
import yfinance as yf
symbols = ['QQQ', 'GBTC']
df1 = yf.download(symbols, start="2019-01-29", end="2019-03-01")
df2 = yf.download(symbols, start="2019-02-25", end="2019-04-03")
Upvotes: 1
Views: 280
Reputation: 1738
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
So I believe that if you specify the index in the merge with on=Date
, then you should be ok.
df1.merge(df2, how='outer', on='Date')
However, for the problem that you are trying to solve merge
is note the correct tool. What you need to do is append
the dataframes together and then remove the duplicated days:
df1.append(df2).drop_duplicates()
Upvotes: 1