Sam Beard
Sam Beard

Reputation: 100

Setting multiple columns at once with MultiIndex Dataframe

I want to be able to set multiple columns at once using a top level column indexer in a DataFrame with a MultiIndex column structure.

Data Definition

I am performing data processing using a DataFrame which has a number of columns structured in a two level multi index. I do this so that I can group related variables and access them easily. The section below gives a general overview of what this DataFrame looks like:

import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10), columns = pd.MultiIndex.from_arrays([["Input"]*3+["Output"]*3+["Meta"],[*"XYZ"]*2+["ID"]]))
df["Input"] = np.random.rand(10,3)
df["Output"] = np.random.rand(10,3)
df["Meta"] = ["a"]*4 + ["b"]*6
print(df)

and the output:

      Input                        Output                     Meta
          X         Y         Z         X         Y         Z   ID
0  0.360439  0.604642  0.255683  0.662775  0.627135  0.920299    a
1  0.462851  0.695910  0.705533  0.969508  0.568267  0.878743    a
2  0.926175  0.792516  0.191144  0.194397  0.152643  0.837344    a
3  0.928863  0.385174  0.452903  0.267481  0.924337  0.264785    a
4  0.244422  0.961650  0.557120  0.441052  0.596733  0.676364    b
5  0.462019  0.226703  0.725212  0.526478  0.660846  0.938667    b
6  0.069525  0.658742  0.288887  0.892227  0.907018  0.784265    b
7  0.338952  0.859036  0.854475  0.562198  0.857035  0.984735    b
8  0.560532  0.117619  0.227673  0.048954  0.987209  0.066062    b
9  0.265070  0.960674  0.984968  0.348832  0.757715  0.175176    b

Problem Statement

As you can see from the previous code sample I use df["Input"] to set multiple columns at once. A piece of syntax which is very convenient as it provides me with a higher level abstraction. This might be convenient if later for instance I add another dimension to Input or Output, for instance Magnitude. I won't need to update my logic there.

This works fine if the (top-level) columns are already defined. However if I want to add more columns using the top-level column indexer, I get a ValueError.

df["Diff"] = df["Input"] - df["Input"].shift(1)
      6 df["Meta"] = ["a"]*4 + ["b"]*6
      7 
----> 8 df["Diff"] = df["Input"] - df["Input"].shift(1)
      9 print(df)

ValueError: Cannot set a DataFrame with multiple columns to the single column Diff

I understand the expression returns a DataFrame, and I understand this syntax is usually used to set only one column. However this syntax works when the columns are already defined. Furthermore, the resulting dataframe from the expression will have the X, Y and Z columns associated with them. So setting the sub level column indexes should not be a problem. Any idea why this problem exists? Or am I just missing something? The syntax seems to be inconsistent, intuitively this should also work.

Alternative

I can come up with a workaround by indexing the df with a list comprehension syntax like so:

df[[("Diff", c) for c in df["Input"].columns]] = df["Input"] - df["Input"].shift(1)

But in my opinion this syntaxes is unnecessarily verbose and are prone to update mistakes, especially when compared with the intuitive syntax I expected to be able to use. Perhaps this is just the messy nature of data science packages like pandas.

Extension: Setting a slice of MultiIndex

The obfuscated manner of indexing a MultiIndex column is only extended when performing slicing on the index. Using the same data as in the previous example, one could imagine wanting to perform something similar to the following:

# create the Diff columns and set them to NaN so that they can be accessed via .loc
df[[("Diff", c) for c in df["Input"].columns]] = np.NaN
for g in df.groupby(("Meta","ID")):
    df.loc[g[1].index, "Diff"] = g[1]["Input"] - g[1]["Input"].shift(1)
print(df)

which results in:

      Input                        Output                     Meta Diff     
          X         Y         Z         X         Y         Z   ID    X   Y  Z    
0  0.698764  0.114242  0.620051  0.537281  0.247196  0.606321    a  NaN NaN NaN    
1  0.285718  0.316356  0.942264  0.160217  0.799764  0.193138    a  NaN NaN NaN    
2  0.018924  0.597641  0.677433  0.496869  0.233845  0.987813    a  NaN NaN NaN    
3  0.899247  0.950712  0.107865  0.904222  0.358268  0.643560    a  NaN NaN NaN    
4  0.702215  0.196937  0.244960  0.728929  0.459252  0.846589    b  NaN NaN NaN    
5  0.987892  0.840522  0.558784  0.942046  0.001634  0.576351    b  NaN NaN NaN    
6  0.375301  0.435768  0.653034  0.738830  0.439018  0.920409    b  NaN NaN NaN    
7  0.031806  0.607084  0.376460  0.050816  0.312656  0.678615    b  NaN NaN NaN    
8  0.286005  0.892434  0.063766  0.263582  0.922249  0.890989    b  NaN NaN NaN    
9  0.050954  0.131088  0.150905  0.569136  0.015628  0.789097    b  NaN NaN NaN    

Which for some reason does not work. However if I do the same operation on a single column the operation does succeed, like so:

df[[("Diff", c) for c in df["Input"].columns]] = np.NaN
for g in df.groupby(("Meta","ID")):
    tmp = g[1]["Input"] - g[1]["Input"].shift(1)
    for c in tmp.columns:
        df.loc[g[1].index, ("Diff",c)] = tmp[c]

And the corresponding output:

      Input                        Output                     Meta      Diff
          X         Y         Z         X         Y         Z   ID         X         Y         Z    
0  0.518577  0.355721  0.142290  0.058178  0.411459  0.546878    a       NaN       NaN       NaN    
1  0.302887  0.388480  0.517935  0.350809  0.738605  0.690886    a -0.215691  0.032759  0.375646    
2  0.262682  0.109533  0.516536  0.624335  0.608022  0.505686    a -0.040204 -0.278947 -0.001400    
3  0.710302  0.238593  0.715624  0.976103  0.354727  0.958263    a  0.447620  0.129061  0.199088    
4  0.222968  0.612778  0.073976  0.793171  0.007743  0.891585    b       NaN       NaN       NaN    
5  0.475678  0.351055  0.453808  0.667844  0.435718  0.409245    b  0.252710 -0.261722  0.379831    
6  0.140727  0.738720  0.598426  0.807662  0.416625  0.505570    b -0.334951  0.387664  0.144618    
7  0.271660  0.886211  0.674875  0.660602  0.293730  0.711890    b  0.130933  0.147491  0.076449    
8  0.149588  0.376687  0.233415  0.845299  0.747255  0.583146    b -0.122073 -0.509524 -0.441460    
9  0.667494  0.460090  0.426279  0.839084  0.202936  0.275418    b  0.517906  0.083403  0.192864  

Perhaps this is a particular quirk with how the loc function works, but to me this is preposterous. I would love some insight into why this works like it works and what the common approach would be as this does not seem so outlandishly strange operation to perform on a data set as such.

Upvotes: 2

Views: 311

Answers (1)

Josh Friedlander
Josh Friedlander

Reputation: 11657

How about this? Not quite as seamless but a bit closer to what you want.

diff = df["Input"] - df["Input"].shift(1)
diff.columns = pd.MultiIndex.from_product([["Diff"], diff.columns])
df = pd.concat([df, diff], axis=1)

Regarding the second part of your question (really a separate question), the problem is that loc can takes a scalar, an array or list, or a DataFrame with matching indices. Pandas sees the X, Y, Z of Diff as different from the X, Y, Z of Input, and therefore sees no match. You can make it work by converting the dataframe to a numpy array:

for g in df.groupby(("Meta","ID")):
    df.loc[g[1].index, "Diff"] = (g[1]["Input"] - g[1]["Input"].shift(1)).to_numpy()

(You can also use values instead of to_numpy(), but it is not recommended, see here as to why.)

Upvotes: 4

Related Questions