Vishnu
Vishnu

Reputation: 21

Why can't I assign groupby result back to the original DataFrame?

Filtering a dataframe with the apply() method works as expected, but when I assign the result to a new column, the new column has NaN values (pfa for screenshot).

But if I comment out the apply() statement then I can see the value for violent_crime_count column. Why?

Data source: https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9/data

#Load data from CSV 
crimes_2015_today_orig = pd.read_csv('/Users/vishnu/data/chicago_crime_dataset/Crimes_-_2015.csv', index_col='Date', parse_dates=True)

# create a filter values 
various_drug_off =  ['POSS: CANNABIS 30GMS OR LESS', 'POSS: HEROIN(WHITE)']

crimes_2015_drug_possession = crimes_2015_today_orig.copy()
crimes_2015_drug_possession['drug_possession'] = ''
crimes_2015_drug_possession = crimes_2015_drug_possession[crimes_2015_drug_possession.Description.apply(lambda x : x in various_drug_off)]

crimes_2015_drug_possession['drug_possession'] = crimes_2015_drug_possession.groupby(pd.TimeGrouper('D')).count()

# create another column to do count on total count violent crime based on arrest column.
crimes_2015_drug_possession['violent_crime_count'] = ''
crimes_2015_drug_possession['violent_crime_count'] = crimes_2015_drug_possession[crimes_2015_drug_possession.Arrest == True].groupby(pd.TimeGrouper('D')).count()

enter image description here

Upvotes: 0

Views: 497

Answers (1)

cs95
cs95

Reputation: 402613

Data taken from https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9/data

For the first bit, I'd recommend using df.isin, it's much faster:

m = crimes_2015_drug_possession.Description.isin(various_drug_off)
m.head(5)
Date
2015-01-01 00:00:00    False
2015-11-24 17:30:00    False
2015-05-19 01:12:00    False
2015-01-01 00:00:00    False
2015-06-24 06:00:00     True
Name: Description, dtype: bool

crimes_2015_drug_possession['drug_possession'] = m

For the groupby operation, observe:

crimes_2015_drug_possession[crimes_2015_drug_possession.Arrest == True].groupby(pd.TimeGrouper('D')).count().shape
(365, 21)

Notice it is not a single column, but you are trying to assign it to a single column. Now, I believe what you wanted was to count the number of Arrests:

c = crimes_2015_drug_possession.groupby(pd.TimeGrouper('D')).Arrest.count()
c.head(5)     
Date
2015-01-01    1092
2015-01-02     671
2015-01-03     648
2015-01-04     513
2015-01-05     520
Freq: D, Name: Arrest, dtype: int64

This is still one column, however...

c.shape
(365,)

crimes_2015_drug_possession.shape
(263447, 21)

Their sizes are unequal. Assignment of unequal sizes will result in assignment by index, and unmatched values are replaced with NaN. The result of the groupby operation cannot be assigned back to the original.

Upvotes: 1

Related Questions