Pablo
Pablo

Reputation: 13570

Merging and updating multiple pandas dataframes with overlapping columns

I have a hard time merging and updating Pandas dataframes right now.

I have a bunch of CSV files that I'm parsing with pandas (which is not a problem). In very few cases I have multiple files that contains some columns present in both files.

So, for example, let's say I have:

import pandas as pd

a = pd.DataFrame({"A": [0, 1, 2, 3], "B": [4, 5, 6, 7]}, index=[0,1,2,3])
b = pd.DataFrame({"A": [11, 12, 13, 14]}, index=[41,51,61,71])
c = pd.DataFrame({"A": [110, 111, 113]}, index=[0,1,3])

What I want is this dataframe:

     A  B
0  110  4
1  111  5
2    2  6
3  113  7
41  11  NaN
51  12  NaN
61  13  NaN
71  14  NaN

Pandas has this nice guide: Merge, join, concatenate and compare. But I fail to find a solution to what I want to achieve.

For example a.join(b, how="outer") raises ValueError: columns overlap but no suffix specified: Index(['A'], dtype='object'). Passing rsuffix="R" is not an option, because the end result is:

      A    B    AR
0   0.0  4.0   NaN
1   1.0  5.0   NaN
2   2.0  6.0   NaN
3   3.0  7.0   NaN
41  NaN  NaN  11.0
51  NaN  NaN  12.0
61  NaN  NaN  13.0
71  NaN  NaN  14.0

Not quite what I want.

pd.merge(a, b, how="outer") looks promising, but it is not quite right either, because the indices are ignored:

    A    B
0   0  4.0
1   1  5.0
2   2  6.0
3   3  7.0
4  11  NaN
5  12  NaN
6  13  NaN
7  14  NaN

Passing left_index=True and right_index=True yields a dataframe similar to .join(..., rsuffix="_x", lsuffix="_y"), so not what I want.

Using update is almost what I want, a.merge(c) would modify a to

       A  B
0  110.0  4
1  111.0  5
2    2.0  6
3  113.0  7

but a.update(b) does nothing (I assume because the indices of a and b are disjunct).

So, is what I want even possible with a single line of code?

EDIT

I came up with this one:

> lll = pd.concat([a,b, c]).sort_index()
> pd.concat([a,b,c]).sort_index().drop_duplicates().groupby(a.index).last()

      A    B
0   110  4.0
1   111  5.0
2     2  6.0
3   113  7.0
41   11  NaN
51   12  NaN
61   13  NaN
71   14  NaN

This is what I want, the question is: is this correct or just a coincidence that this yields the same result as I wanted?


How are you determining which 'A' column has priority?

In the order I'm reading the files. The files are generated by a device (which is kind of a "black box" to me) and generates files with a date in them. So I do:


tasks = [parse_csv_file(fn) for fn in sorted(glob.glob("*.csv"))]
results = await asyncio.gather(*tasks)

And I would like to do (no error checking as this is an example):

results = iter(results)
merged_df = next(results)

for df in results:
    merged_df = the_magic_function_Im_looking_for(df)

Upvotes: 1

Views: 1763

Answers (2)

Mustafa Aydın
Mustafa Aydın

Reputation: 18306

reduceing with combine_first:

from functools import reduce

to_merge = [c, b, a]

result = reduce(pd.DataFrame.combine_first, to_merge)

which successively applies combine_first to entries of the list to end up with all-combined, i.e., reduced dataframe at the end.

(we can put reversed(to_merge) in reduce if to_merge comes with the reversed order),

to get

>>> result

        A    B
0   110.0  4.0
1   111.0  5.0
2     2.0  6.0
3   113.0  7.0
41   11.0  NaN
51   12.0  NaN
61   13.0  NaN
71   14.0  NaN

Upvotes: 4

Henry Ecker
Henry Ecker

Reputation: 35626

Try with concat + groupby last on axis=1 to merge the Dataframes then get the "last" valid value per column group:

df = pd.concat([a, b, c], axis=1).groupby(level=0, axis=1).last()

df:

        A    B
0   110.0  4.0
1   111.0  5.0
2     2.0  6.0
3   113.0  7.0
41   11.0  NaN
51   12.0  NaN
61   13.0  NaN
71   14.0  NaN

Or concating long and getting the last valid row per row index thanks to @anky:

df = pd.concat([a, b, c]).groupby(level=0).last()

df:

      A    B
0   110  4.0
1   111  5.0
2     2  6.0
3   113  7.0
41   11  NaN
51   12  NaN
61   13  NaN
71   14  NaN

Upvotes: 4

Related Questions