Shortest way of splitting a pandas DataFrame column based on another column

Question

Inspiration

In R, this is very easy

data("iris")
bartlett.test(Sepal.Length ~ Species,data = iris)

The important thing about the data set is that the column Sepal.Length is numerical, the species is categorical.

Problem

In Python scipy.stats.bartlett would need separate arrays for each species, see docs.

What would be the easiest way to achieve this?

An easy way to get the dataset in python:

from sklearn import datasets
iris = datasets.load_iris()
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= ["sepal.length","sepal.width","petal.length","petal.width"] + ['species'])

I really wanted this to work:

iris.groupby("species")["sepal.length"].apply(ss.bartlett)

but it didn't due to it needing multiple sample vectors.

Sven Harris · Accepted Answer

Following the groupby pattern you can do a bit of manipulation and do this:

gb = iris.groupby('species')["sepal.length"]
ss.bartlett(*[gb.get_group(x).values for x in gb.groups])

the * unpacks the list into the function, the rest is just to get the groups into the right form for the function to take. As mentioned in the comments, the .values isn't needed here so we can write it as:

gb = iris.groupby('species')["sepal.length"]
ss.bartlett(*[gb.get_group(x) for x in gb.groups])

And just for completion, if you really want to do it in one line:

ss.bartlett(*[x[1] for x in iris.groupby('species')["sepal.length"]])

But I personally find that less readable.

Shortest way of splitting a pandas DataFrame column based on another column

Answers (1)

Related Questions