bison72
bison72

Reputation: 324

Self defined function works on whole data frame but not on grouped data frame (applied with 'groupby' function) in Python

I have a simple data frame which I wish to apply groupby function on column 'A' and generate new column calculated from defined function (loop within the function) that takes values from column 'B' and column 'C'. My problem is, I was able to able the function to whole data frame but not to grouped data frame (Exception: Column(s) B already selected). I don't why it throws error on grouped data frame but not on whole data frame. My implementation is as below:

>>> import pandas as pd
>>>
>>> df = pd.read_csv("foo.txt", sep="\t")
>>> df
   A  B   C
0  1  4   3
1  1  5   4
2  1  2  10
3  2  7   2
4  2  4   4
5  2  6   6
>>>
>>> def calc(data):
...         length = len(data['B'])
...         mx = data['B'][0]
...         nx = data['C'][0]
...         for i in range(1,length):
...                 my = data['B'][i]
...                 ny = data['C'][i]
...                 nx = nx + ny
...                 mx=(mx*nx+my*ny)/(nx+ny)
...         return(mx)
...
>>> df_grouped = df.groupby(['A'])
>>> calc(df)
4.217694879423274
>>> calc(df_grouped)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in calc
  File "/mnt/projects/kokep/kokep/devel/miniconda3/lib/python3.6/site-packages/pandas/core/base.py", line 250, in __getitem__
    .format(selection=self._selection))
Exception: Column(s) B already selected
>>>

How can I get it worked? Thanks in advance.

Upvotes: 0

Views: 59

Answers (2)

bison72
bison72

Reputation: 324

I figured out the problem. I think reset_index function need to be applied for each of the groups:

>>> import pandas as pd
>>>
>>> df = pd.read_csv("foo.txt", sep="\t")
>>> df
   A  B   C
0  1  4   3
1  1  5   4
2  1  2  10
3  2  7   2
4  2  4   4
5  2  6   6
>>>
>>> def calc(data):
...         length = len(data['B'])
...         mx = data['B'][0]
...         nx = data['C'][0]
...         for i in range(1,length):
...                 my = data['B'][i]
...                 ny = data['C'][i]
...                 nx = nx + ny
...                 mx=(mx*nx+my*ny)/(nx+ny)
...         return(mx)
...
>>> result = []
>>> for name, group in df.groupby('A'):
...         group = pd.DataFrame(group).reset_index()
...         out = calc(group)
...         result.append(out)
...
>>> result
[3.488215488215488, 5.866666666666666]

Upvotes: 1

Vishwas
Vishwas

Reputation: 351

I think your groupby is producing pandas.series and your function is not applied on this series. I tried playing with different groupby methods, for some reason it's not working. Once I find the solution, I will post it here.

Upvotes: 0

Related Questions