GroupBy Method Changing DataType

Question

Using Python3 and Anaconda, I have pandas and os imported on ipython. I have an extremely large csv file. After using read_csv on the file, I try to use .groupby() on two columns, but it changes the data type from DataFrame to DataFrameGroupBy, and I can no longer run data frame methods on it.

I can't think of anything to try. I have very little experience with pandas, gained through codecademy. My code appears to work there.

import os
import pandas as pd

totals = pd.read_csv('filename')

band_gaps = totals.groupby(['column1','column2'])

band_gaps.info()
AttributeError: Cannot access callable attribute 'info' of 
'DataFrameGroupBy' objects, try using the 'apply' method

type(band_gaps)
pandas.core.groupby.generic.DataFrameGroupBy

I expect that when I run band_gaps.info(), it provides me with the info for the data frame. Instead, it gives me an error. When I check band_gaps' type, it is no longer a dataframe, and is instead a DataFrameGroupBy.

Ben · Accepted Answer

If you look at the Pandas groupby documentation you'll see that it returns a DataFrameGroupBy or SeriesGroupBy object, depending on whether you called .groupby on a DataFrame or a Series. So the behavior you've observed shouldn't be surprising.

More importantly, why does Pandas do that? Well, in your case you're grouping a bunch of rows together. Pandas can hold on to some representation of the grouped DataFrame, but it can't do anything else with it (ie, return it to you as another DataFrame) until you apply an aggregation function like .sum or .count. An aggregation function takes each group of rows and defines some way of turning that row into a single row. Try applying one of those aggregation functions to band_gaps and see what happens.

For example:

df.groupby('column1').mean()

will return a DataFrame expressing the mean of every column after grouping all rows by column1.

df.groupby('column1')['column2'].sum()

will return a Series with the sum of the values in column2 after grouping by column1. Note that

df.groupby('column1').sum()['column2']

may also be possible, but in that case you're taking the column you're interested in after you've aggregated over all columns, which is slower than slicing before aggregating.

GroupBy Method Changing DataType

Answers (1)

Related Questions