Reputation: 2101
I am working with sets of data that essentially provide me a list of MultiIndex DataFrames each with one column. This data is indexed by State
, Year
, and Hour
. I am looking for a way to create one large DataFrame that holds all of the data with the index intact.
My problem: when I go to concatenate this list of DataFrames, I am ending up with duplicate columns that should really be "smushed" together.
Here is a reproducible example:
#!/usr/bin/env python3
import pandas as pd
import numpy as np
oh_vars = [f'OH-var{i}' for i in range(1, 4)]
il_vars = [f'IL-var{i}' for i in range(1, 4)]
all_vars = oh_vars + il_vars
dats = []
for var in all_vars:
state = (sp:=var.split('-'))[0]
col = sp[1]
idx = pd.MultiIndex.from_product(
[[state], np.arange(2015, 2030, 5), range(1, 4)],
names=['state', 'year', 'hour']
)
vals = np.random.rand(9)
df = pd.DataFrame({col: vals}, index=idx)
dats.append(df)
pd.concat(dats, axis=1)
# var1 var2 var3 var1 var2 var3
# state year hour
# IL 2015 1 NaN NaN NaN 0.431365 0.127444 0.081975
# 2 NaN NaN NaN 0.430269 0.822969 0.014894
# 3 NaN NaN NaN 0.019409 0.418380 0.306845
# 2020 1 NaN NaN NaN 0.860500 0.887411 0.614720
# 2 NaN NaN NaN 0.094476 0.062544 0.236143
# 3 NaN NaN NaN 0.956516 0.616292 0.550528
# 2025 1 NaN NaN NaN 0.594198 0.252347 0.534134
# 2 NaN NaN NaN 0.209072 0.044660 0.763966
# 3 NaN NaN NaN 0.441373 0.923336 0.017912
# OH 2015 1 0.720927 0.623426 0.779579 NaN NaN NaN
# 2 0.290496 0.595051 0.815390 NaN NaN NaN
# 3 0.308653 0.630301 0.894379 NaN NaN NaN
# 2020 1 0.642145 0.479133 0.620119 NaN NaN NaN
# 2 0.711534 0.294373 0.276065 NaN NaN NaN
# 3 0.814914 0.486108 0.272633 NaN NaN NaN
# 2025 1 0.197927 0.374185 0.544227 NaN NaN NaN
# 2 0.498159 0.709283 0.527194 NaN NaN NaN
# 3 0.642871 0.540829 0.492660 NaN NaN NaN
Here is the desired outcome (ignore the differing values, I am after the structure):
idx = pd.MultiIndex.from_product(
[["OH", "IL"], np.arange(2015, 2030, 5), range(1, 4)],
names=['state', 'year', 'hour']
)
df = pd.DataFrame({f'var{i}': np.random.rand(18) for i in range(1, 4)}, index=idx)
# var1 var2 var3
# state year hour
# OH 2015 1 0.689286 0.684138 0.013763
# 2 0.073285 0.231948 0.418913
# 3 0.021302 0.902437 0.358228
# 2020 1 0.349490 0.815960 0.621096
# 2 0.374491 0.939609 0.334709
# 3 0.308876 0.888048 0.483043
# 2025 1 0.482573 0.127392 0.581547
# 2 0.297424 0.861262 0.652554
# 3 0.162204 0.937405 0.015562
# IL 2015 1 0.586172 0.229906 0.181549
# 2 0.568151 0.984352 0.407318
# 3 0.249635 0.615331 0.059464
# 2020 1 0.128628 0.376220 0.906765
# 2 0.907705 0.257347 0.875211
# 3 0.280862 0.118245 0.658864
# 2025 1 0.277835 0.899616 0.037291
# 2 0.183380 0.814610 0.159156
# 3 0.444279 0.286332 0.292053
Upvotes: 1
Views: 518
Reputation: 153460
Here's one way using reduce
from functools and pd.DataFrame.combine_first
:
from functools import reduce
reduce(lambda x, y: x.combine_first(y), dats)
Output:
var1 var2 var3
state year hour
IL 2015 1 0.879976 0.182038 0.890930
2 0.048066 0.085025 0.225444
3 0.857504 0.374792 0.257174
2020 1 0.896210 0.765692 0.268114
2 0.292000 0.977814 0.071763
3 0.039871 0.439834 0.524035
2025 1 0.996841 0.542542 0.891044
2 0.971756 0.295079 0.832378
3 0.742600 0.006673 0.988244
OH 2015 1 0.312606 0.727440 0.293027
2 0.809334 0.088261 0.379842
3 0.648492 0.884049 0.079402
2020 1 0.605916 0.329473 0.313885
2 0.749340 0.888296 0.390865
3 0.165285 0.723402 0.074149
2025 1 0.205537 0.270733 0.306020
2 0.935698 0.532955 0.258660
3 0.295059 0.507919 0.284783
Upvotes: 2