Reputation: 13570
I have a hard time merging and updating Pandas dataframes right now.
I have a bunch of CSV files that I'm parsing with pandas (which is not a problem). In very few cases I have multiple files that contains some columns present in both files.
So, for example, let's say I have:
import pandas as pd
a = pd.DataFrame({"A": [0, 1, 2, 3], "B": [4, 5, 6, 7]}, index=[0,1,2,3])
b = pd.DataFrame({"A": [11, 12, 13, 14]}, index=[41,51,61,71])
c = pd.DataFrame({"A": [110, 111, 113]}, index=[0,1,3])
What I want is this dataframe:
A B
0 110 4
1 111 5
2 2 6
3 113 7
41 11 NaN
51 12 NaN
61 13 NaN
71 14 NaN
Pandas has this nice guide: Merge, join, concatenate and compare. But I fail to find a solution to what I want to achieve.
For example a.join(b, how="outer")
raises ValueError: columns overlap but no suffix specified: Index(['A'], dtype='object')
. Passing rsuffix="R"
is not an option, because the end result is:
A B AR
0 0.0 4.0 NaN
1 1.0 5.0 NaN
2 2.0 6.0 NaN
3 3.0 7.0 NaN
41 NaN NaN 11.0
51 NaN NaN 12.0
61 NaN NaN 13.0
71 NaN NaN 14.0
Not quite what I want.
pd.merge(a, b, how="outer")
looks promising, but it is not quite right either,
because the indices are ignored:
A B
0 0 4.0
1 1 5.0
2 2 6.0
3 3 7.0
4 11 NaN
5 12 NaN
6 13 NaN
7 14 NaN
Passing left_index=True
and right_index=True
yields a dataframe similar to
.join(..., rsuffix="_x", lsuffix="_y")
, so not what I want.
Using update
is almost what I want, a.merge(c)
would modify a
to
A B
0 110.0 4
1 111.0 5
2 2.0 6
3 113.0 7
but a.update(b)
does nothing (I assume because the indices of a
and b
are
disjunct).
So, is what I want even possible with a single line of code?
EDIT
I came up with this one:
> lll = pd.concat([a,b, c]).sort_index()
> pd.concat([a,b,c]).sort_index().drop_duplicates().groupby(a.index).last()
A B
0 110 4.0
1 111 5.0
2 2 6.0
3 113 7.0
41 11 NaN
51 12 NaN
61 13 NaN
71 14 NaN
This is what I want, the question is: is this correct or just a coincidence that this yields the same result as I wanted?
How are you determining which 'A' column has priority?
In the order I'm reading the files. The files are generated by a device (which is kind of a "black box" to me) and generates files with a date in them. So I do:
tasks = [parse_csv_file(fn) for fn in sorted(glob.glob("*.csv"))]
results = await asyncio.gather(*tasks)
And I would like to do (no error checking as this is an example):
results = iter(results)
merged_df = next(results)
for df in results:
merged_df = the_magic_function_Im_looking_for(df)
Upvotes: 1
Views: 1763
Reputation: 18306
reduce
ing with combine_first
:
from functools import reduce
to_merge = [c, b, a]
result = reduce(pd.DataFrame.combine_first, to_merge)
which successively applies combine_first
to entries of the list to end up with all-combined, i.e., reduced dataframe at the end.
(we can put reversed(to_merge)
in reduce
if to_merge
comes with the reversed order),
to get
>>> result
A B
0 110.0 4.0
1 111.0 5.0
2 2.0 6.0
3 113.0 7.0
41 11.0 NaN
51 12.0 NaN
61 13.0 NaN
71 14.0 NaN
Upvotes: 4
Reputation: 35626
Try with concat
+ groupby last
on axis=1 to merge the Dataframes then get the "last" valid value per column group:
df = pd.concat([a, b, c], axis=1).groupby(level=0, axis=1).last()
df
:
A B
0 110.0 4.0
1 111.0 5.0
2 2.0 6.0
3 113.0 7.0
41 11.0 NaN
51 12.0 NaN
61 13.0 NaN
71 14.0 NaN
Or concat
ing long and getting the last valid row per row index thanks to @anky:
df = pd.concat([a, b, c]).groupby(level=0).last()
df
:
A B
0 110 4.0
1 111 5.0
2 2 6.0
3 113 7.0
41 11 NaN
51 12 NaN
61 13 NaN
71 14 NaN
Upvotes: 4