Python pandas: type error in groupby

Question

I read in a csv file as the following:

import pandas as pd

out = "M:/transitions.csv"
transitions = pd.read_csv(out)

transitions = transitions.groupby('unique_pid')

Here is what my dataframe looks like:

In [11]: transitions.head()
Out[11]:

Int64Index: 5 entries, 0 to 4
Data columns (total 24 columns):
Unnamed: 0                5  non-null values
Unnamed: 0.1              5  non-null values
Unnamed: 0.1              5  non-null values
unique_pid                5  non-null values
Unnamed: 1                5  non-null values
unique_pid.1              5  non-null values
age                       5  non-null values
age2                      5  non-null values
year                      5  non-null values
Single-family house       5  non-null values
Duplex/ 2-family house    5  non-null values
Multifamily               5  non-null values
Mobile Home/ trailer      5  non-null values
Condo                     5  non-null values
Townhouse                 5  non-null values
Other                     5  non-null values
Don't know                5  non-null values
Refused                   5  non-null values
numrooms                  5  non-null values
famsize                   5  non-null values
moved                     5  non-null values
whymoved                  5  non-null values 
seniorh                   5  non-null values
inst                      5  non-null values
dtypes: float64(10), int64(14)

And I get the following error:

TypeError: 'DataFrame' object is not callable

I checked that the key 'unique_pid' is in my dataframe with the following code:

In [8]: print 'unique_pid' in transitions
True

So it is clearly a valid key. I've used groupby many times this way before with no problems, so I'm not sure what's going wrong.

BKay · Accepted Answer

This appears to be a more subtle version of the problem in this SO question.

Essentially, while a DF will happily import data with multiple identically named columns, the groupby function seems to choke and die on dataframes of that sort. Renaming the duplicated column names usually sorts things out. What's weird about your problem is that the columns are not identical, merely very close. No idea why that would be a problem. However, I see that Unnamed: 0.1 appears twice, that may be messing this up too. When forced to deal with data suffering from this duplication, I recommend an initial step of renaming all the column headers with a sensible and unique list of names. You can do that by assigning a list of the new column header strings to transitions.columns a la:

 transitions.columns  = ['Unnamed_A', 'Unnamed_B', 'Unnamed_C', 'unique_pid_A', 'Unnamed_D', 'unique_pid_B', 'age', 'age2', 'year', 'Single-family house', 'Duplex_2_family hou', 'Multifamily', 'Mobile Home/ trailer', 'Condo', 'Townhouse', 'Other', 'Don't know', 'Refused', 'numrooms', 'famsize', 'moved', 'whymoved', 'seniorh', 'inst']

Python pandas: type error in groupby

Answers (1)

Related Questions