Reputation: 43
i have edited this post with the specific case:
i have a list of dataframes like this (note that df1 and df2 have a row in common)
df1
index | Date | A |
---|---|---|
0 | 2010-06-19 | 4 |
1 | 2010-06-20 | 3 |
2 | 2010-06-21 | 2 |
3 | 2010-06-22 | 1 |
4 | 2012-07-19 | 5 |
df2
index | Date | B |
---|---|---|
0 | 2012-07-19 | 5 |
1 | 2012-07-20 | 6 |
df3
index | Date | C |
---|---|---|
0 | 2020-06-19 | 5 |
1 | 2020-06-20 | 2 |
2 | 2020-06-21 | 9 |
df_list = [df1, df2, df3]
I would like to merge all dataframes in a single dataframe, without losing rows and placing nan where there are no things to merge. The criteria would be merging them by the column 'Date' (the column should have all the dates of all the merged dataframes, ordered by date).
The resulting dataframe should look like this:
Resulting Dataframe:
index | Date | A | B | C |
---|---|---|---|---|
0 | 2010-06-19 | 4 | nan | nan |
1 | 2010-06-20 | 3 | nan | nan |
2 | 2010-06-21 | 2 | nan | nan |
3 | 2010-06-22 | 1 | nan | nan |
4 | 2012-07-19 | 5 | 5 | nan |
5 | 2012-07-20 | nan | 6 | nan |
6 | 2020-06-19 | nan | nan | 5 |
7 | 2020-06-20 | nan | nan | 2 |
8 | 2020-06-21 | nan | nan | 9 |
I tried something like this:
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Date'], how='outer'), df_list)
BUT the resulting dataframe is not as expected (i miss some columns and is not ordered by date). I think i am missing something.
Thank you very much
Upvotes: 1
Views: 6743
Reputation: 43
For overlapping data, I had to add the option: Sort = TRUE
in the lambda function. Seemed I was missing the order for big dataframes and I was only seeing the nan at the end and start of frames. Thank you all ;-)
from functools import reduce
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Date'],
how='outer', sort=True), df_list)
Upvotes: 1
Reputation:
Use pandas.concat()
. It takes a list of dataframes, and appends common columns together, filling new columns with NaN as necessary:
new_df = pd.concat([df1, df2, df3])
Output:
>>> new_df
index Date A B C
0 0 2010-06-19 4.0 NaN NaN
1 1 2010-06-20 3.0 NaN NaN
2 2 2010-06-21 2.0 NaN NaN
3 3 2010-06-22 1.0 NaN NaN
0 0 2012-07-19 NaN 5.0 NaN
1 1 2012-07-20 NaN 6.0 NaN
0 0 2020-06-19 NaN NaN 5.0
1 1 2020-06-20 NaN NaN 2.0
2 2 2020-06-21 NaN NaN 9.0
Upvotes: 0