Reputation: 100
I want to be able to set multiple columns at once using a top level column indexer in a DataFrame with a MultiIndex column structure.
I am performing data processing using a DataFrame which has a number of columns structured in a two level multi index. I do this so that I can group related variables and access them easily. The section below gives a general overview of what this DataFrame looks like:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10), columns = pd.MultiIndex.from_arrays([["Input"]*3+["Output"]*3+["Meta"],[*"XYZ"]*2+["ID"]]))
df["Input"] = np.random.rand(10,3)
df["Output"] = np.random.rand(10,3)
df["Meta"] = ["a"]*4 + ["b"]*6
print(df)
and the output:
Input Output Meta
X Y Z X Y Z ID
0 0.360439 0.604642 0.255683 0.662775 0.627135 0.920299 a
1 0.462851 0.695910 0.705533 0.969508 0.568267 0.878743 a
2 0.926175 0.792516 0.191144 0.194397 0.152643 0.837344 a
3 0.928863 0.385174 0.452903 0.267481 0.924337 0.264785 a
4 0.244422 0.961650 0.557120 0.441052 0.596733 0.676364 b
5 0.462019 0.226703 0.725212 0.526478 0.660846 0.938667 b
6 0.069525 0.658742 0.288887 0.892227 0.907018 0.784265 b
7 0.338952 0.859036 0.854475 0.562198 0.857035 0.984735 b
8 0.560532 0.117619 0.227673 0.048954 0.987209 0.066062 b
9 0.265070 0.960674 0.984968 0.348832 0.757715 0.175176 b
As you can see from the previous code sample I use df["Input"]
to set multiple columns at once. A piece of syntax which is very convenient as it provides me with a higher level abstraction. This might be convenient if later for instance I add another dimension to Input
or Output
, for instance Magnitude
. I won't need to update my logic there.
This works fine if the (top-level) columns are already defined. However if I want to add more columns using the top-level column indexer, I get a ValueError.
df["Diff"] = df["Input"] - df["Input"].shift(1)
6 df["Meta"] = ["a"]*4 + ["b"]*6
7
----> 8 df["Diff"] = df["Input"] - df["Input"].shift(1)
9 print(df)
ValueError: Cannot set a DataFrame with multiple columns to the single column Diff
I understand the expression returns a DataFrame, and I understand this syntax is usually used to set only one column. However this syntax works when the columns are already defined. Furthermore, the resulting dataframe from the expression will have the X
, Y
and Z
columns associated with them. So setting the sub level column indexes should not be a problem. Any idea why this problem exists? Or am I just missing something? The syntax seems to be inconsistent, intuitively this should also work.
I can come up with a workaround by indexing the df with a list comprehension syntax like so:
df[[("Diff", c) for c in df["Input"].columns]] = df["Input"] - df["Input"].shift(1)
But in my opinion this syntaxes is unnecessarily verbose and are prone to update mistakes, especially when compared with the intuitive syntax I expected to be able to use. Perhaps this is just the messy nature of data science packages like pandas.
The obfuscated manner of indexing a MultiIndex column is only extended when performing slicing on the index. Using the same data as in the previous example, one could imagine wanting to perform something similar to the following:
# create the Diff columns and set them to NaN so that they can be accessed via .loc
df[[("Diff", c) for c in df["Input"].columns]] = np.NaN
for g in df.groupby(("Meta","ID")):
df.loc[g[1].index, "Diff"] = g[1]["Input"] - g[1]["Input"].shift(1)
print(df)
which results in:
Input Output Meta Diff
X Y Z X Y Z ID X Y Z
0 0.698764 0.114242 0.620051 0.537281 0.247196 0.606321 a NaN NaN NaN
1 0.285718 0.316356 0.942264 0.160217 0.799764 0.193138 a NaN NaN NaN
2 0.018924 0.597641 0.677433 0.496869 0.233845 0.987813 a NaN NaN NaN
3 0.899247 0.950712 0.107865 0.904222 0.358268 0.643560 a NaN NaN NaN
4 0.702215 0.196937 0.244960 0.728929 0.459252 0.846589 b NaN NaN NaN
5 0.987892 0.840522 0.558784 0.942046 0.001634 0.576351 b NaN NaN NaN
6 0.375301 0.435768 0.653034 0.738830 0.439018 0.920409 b NaN NaN NaN
7 0.031806 0.607084 0.376460 0.050816 0.312656 0.678615 b NaN NaN NaN
8 0.286005 0.892434 0.063766 0.263582 0.922249 0.890989 b NaN NaN NaN
9 0.050954 0.131088 0.150905 0.569136 0.015628 0.789097 b NaN NaN NaN
Which for some reason does not work. However if I do the same operation on a single column the operation does succeed, like so:
df[[("Diff", c) for c in df["Input"].columns]] = np.NaN
for g in df.groupby(("Meta","ID")):
tmp = g[1]["Input"] - g[1]["Input"].shift(1)
for c in tmp.columns:
df.loc[g[1].index, ("Diff",c)] = tmp[c]
And the corresponding output:
Input Output Meta Diff
X Y Z X Y Z ID X Y Z
0 0.518577 0.355721 0.142290 0.058178 0.411459 0.546878 a NaN NaN NaN
1 0.302887 0.388480 0.517935 0.350809 0.738605 0.690886 a -0.215691 0.032759 0.375646
2 0.262682 0.109533 0.516536 0.624335 0.608022 0.505686 a -0.040204 -0.278947 -0.001400
3 0.710302 0.238593 0.715624 0.976103 0.354727 0.958263 a 0.447620 0.129061 0.199088
4 0.222968 0.612778 0.073976 0.793171 0.007743 0.891585 b NaN NaN NaN
5 0.475678 0.351055 0.453808 0.667844 0.435718 0.409245 b 0.252710 -0.261722 0.379831
6 0.140727 0.738720 0.598426 0.807662 0.416625 0.505570 b -0.334951 0.387664 0.144618
7 0.271660 0.886211 0.674875 0.660602 0.293730 0.711890 b 0.130933 0.147491 0.076449
8 0.149588 0.376687 0.233415 0.845299 0.747255 0.583146 b -0.122073 -0.509524 -0.441460
9 0.667494 0.460090 0.426279 0.839084 0.202936 0.275418 b 0.517906 0.083403 0.192864
Perhaps this is a particular quirk with how the loc
function works, but to me this is preposterous. I would love some insight into why this works like it works and what the common approach would be as this does not seem so outlandishly strange operation to perform on a data set as such.
Upvotes: 2
Views: 311
Reputation: 11657
How about this? Not quite as seamless but a bit closer to what you want.
diff = df["Input"] - df["Input"].shift(1)
diff.columns = pd.MultiIndex.from_product([["Diff"], diff.columns])
df = pd.concat([df, diff], axis=1)
Regarding the second part of your question (really a separate question), the problem is that loc
can takes a scalar, an array or list, or a DataFrame with matching indices. Pandas sees the X, Y, Z
of Diff
as different from the X, Y, Z
of Input
, and therefore sees no match. You can make it work by converting the dataframe to a numpy array:
for g in df.groupby(("Meta","ID")):
df.loc[g[1].index, "Diff"] = (g[1]["Input"] - g[1]["Input"].shift(1)).to_numpy()
(You can also use values
instead of to_numpy()
, but it is not recommended, see here as to why.)
Upvotes: 4