Reputation: 39
Using Python3 and Anaconda, I have pandas and os imported on ipython. I have an extremely large csv file. After using read_csv on the file, I try to use .groupby() on two columns, but it changes the data type from DataFrame to DataFrameGroupBy, and I can no longer run data frame methods on it.
I can't think of anything to try. I have very little experience with pandas, gained through codecademy. My code appears to work there.
import os
import pandas as pd
totals = pd.read_csv('filename')
band_gaps = totals.groupby(['column1','column2'])
band_gaps.info()
AttributeError: Cannot access callable attribute 'info' of
'DataFrameGroupBy' objects, try using the 'apply' method
type(band_gaps)
pandas.core.groupby.generic.DataFrameGroupBy
I expect that when I run band_gaps.info(), it provides me with the info for the data frame. Instead, it gives me an error. When I check band_gaps' type, it is no longer a dataframe, and is instead a DataFrameGroupBy.
Upvotes: 0
Views: 3144
Reputation: 73
If you look at the Pandas groupby documentation you'll see that it returns a DataFrameGroupBy
or SeriesGroupBy
object, depending on whether you called .groupby
on a DataFrame
or a Series
. So the behavior you've observed shouldn't be surprising.
More importantly, why does Pandas do that? Well, in your case you're grouping a bunch of rows together. Pandas can hold on to some representation of the grouped DataFrame
, but it can't do anything else with it (ie, return it to you as another DataFrame
) until you apply an aggregation function like .sum
or .count
. An aggregation function takes each group of rows and defines some way of turning that row into a single row. Try applying one of those aggregation functions to band_gaps
and see what happens.
For example:
df.groupby('column1').mean()
will return a DataFrame
expressing the mean of every column after grouping all rows by column1
.
df.groupby('column1')['column2'].sum()
will return a Series
with the sum of the values in column2
after grouping by column1
. Note that
df.groupby('column1').sum()['column2']
may also be possible, but in that case you're taking the column you're interested in after you've aggregated over all columns, which is slower than slicing before aggregating.
Upvotes: 1