dylanjm
dylanjm

Reputation: 2101

Coalesce duplicate columns created by pandas concat?

I am working with sets of data that essentially provide me a list of MultiIndex DataFrames each with one column. This data is indexed by State, Year, and Hour. I am looking for a way to create one large DataFrame that holds all of the data with the index intact.

My problem: when I go to concatenate this list of DataFrames, I am ending up with duplicate columns that should really be "smushed" together.

Here is a reproducible example:

#!/usr/bin/env python3
import pandas as pd
import numpy as np


oh_vars = [f'OH-var{i}' for i in range(1, 4)]
il_vars = [f'IL-var{i}' for i in range(1, 4)]
all_vars = oh_vars + il_vars

dats = []
for var in all_vars:
    state = (sp:=var.split('-'))[0]
    col = sp[1]
    idx = pd.MultiIndex.from_product(
        [[state], np.arange(2015, 2030, 5), range(1, 4)],
        names=['state', 'year', 'hour']
    )
    vals = np.random.rand(9)
    df = pd.DataFrame({col: vals}, index=idx)
    dats.append(df)

pd.concat(dats, axis=1)
#                      var1      var2      var3      var1      var2      var3
# state year hour
# IL    2015 1          NaN       NaN       NaN  0.431365  0.127444  0.081975
#            2          NaN       NaN       NaN  0.430269  0.822969  0.014894
#            3          NaN       NaN       NaN  0.019409  0.418380  0.306845
#       2020 1          NaN       NaN       NaN  0.860500  0.887411  0.614720
#            2          NaN       NaN       NaN  0.094476  0.062544  0.236143
#            3          NaN       NaN       NaN  0.956516  0.616292  0.550528
#       2025 1          NaN       NaN       NaN  0.594198  0.252347  0.534134
#            2          NaN       NaN       NaN  0.209072  0.044660  0.763966
#            3          NaN       NaN       NaN  0.441373  0.923336  0.017912
# OH    2015 1     0.720927  0.623426  0.779579       NaN       NaN       NaN
#            2     0.290496  0.595051  0.815390       NaN       NaN       NaN
#            3     0.308653  0.630301  0.894379       NaN       NaN       NaN
#       2020 1     0.642145  0.479133  0.620119       NaN       NaN       NaN
#            2     0.711534  0.294373  0.276065       NaN       NaN       NaN
#            3     0.814914  0.486108  0.272633       NaN       NaN       NaN
#       2025 1     0.197927  0.374185  0.544227       NaN       NaN       NaN
#            2     0.498159  0.709283  0.527194       NaN       NaN       NaN
#            3     0.642871  0.540829  0.492660       NaN       NaN       NaN

Here is the desired outcome (ignore the differing values, I am after the structure):

idx = pd.MultiIndex.from_product(
    [["OH", "IL"], np.arange(2015, 2030, 5), range(1, 4)],
    names=['state', 'year', 'hour']
)
df = pd.DataFrame({f'var{i}': np.random.rand(18) for i in range(1, 4)}, index=idx)
#                      var1      var2      var3
# state year hour                              
# OH    2015 1     0.689286  0.684138  0.013763
#            2     0.073285  0.231948  0.418913
#            3     0.021302  0.902437  0.358228
#       2020 1     0.349490  0.815960  0.621096
#            2     0.374491  0.939609  0.334709
#            3     0.308876  0.888048  0.483043
#       2025 1     0.482573  0.127392  0.581547
#            2     0.297424  0.861262  0.652554
#            3     0.162204  0.937405  0.015562
# IL    2015 1     0.586172  0.229906  0.181549
#            2     0.568151  0.984352  0.407318
#            3     0.249635  0.615331  0.059464
#       2020 1     0.128628  0.376220  0.906765
#            2     0.907705  0.257347  0.875211
#            3     0.280862  0.118245  0.658864
#       2025 1     0.277835  0.899616  0.037291
#            2     0.183380  0.814610  0.159156
#            3     0.444279  0.286332  0.292053

Upvotes: 1

Views: 518

Answers (1)

Scott Boston
Scott Boston

Reputation: 153460

Here's one way using reduce from functools and pd.DataFrame.combine_first:

from functools import reduce

reduce(lambda x, y: x.combine_first(y), dats)

Output:

                     var1      var2      var3
state year hour                              
IL    2015 1     0.879976  0.182038  0.890930
           2     0.048066  0.085025  0.225444
           3     0.857504  0.374792  0.257174
      2020 1     0.896210  0.765692  0.268114
           2     0.292000  0.977814  0.071763
           3     0.039871  0.439834  0.524035
      2025 1     0.996841  0.542542  0.891044
           2     0.971756  0.295079  0.832378
           3     0.742600  0.006673  0.988244
OH    2015 1     0.312606  0.727440  0.293027
           2     0.809334  0.088261  0.379842
           3     0.648492  0.884049  0.079402
      2020 1     0.605916  0.329473  0.313885
           2     0.749340  0.888296  0.390865
           3     0.165285  0.723402  0.074149
      2025 1     0.205537  0.270733  0.306020
           2     0.935698  0.532955  0.258660
           3     0.295059  0.507919  0.284783

Upvotes: 2

Related Questions