Reputation: 21
Filtering a dataframe with the apply()
method works as expected, but when I assign the result to a new column, the new column has NaN values (pfa for screenshot).
But if I comment out the apply()
statement then I can see the value for violent_crime_count
column. Why?
Data source: https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9/data
#Load data from CSV
crimes_2015_today_orig = pd.read_csv('/Users/vishnu/data/chicago_crime_dataset/Crimes_-_2015.csv', index_col='Date', parse_dates=True)
# create a filter values
various_drug_off = ['POSS: CANNABIS 30GMS OR LESS', 'POSS: HEROIN(WHITE)']
crimes_2015_drug_possession = crimes_2015_today_orig.copy()
crimes_2015_drug_possession['drug_possession'] = ''
crimes_2015_drug_possession = crimes_2015_drug_possession[crimes_2015_drug_possession.Description.apply(lambda x : x in various_drug_off)]
crimes_2015_drug_possession['drug_possession'] = crimes_2015_drug_possession.groupby(pd.TimeGrouper('D')).count()
# create another column to do count on total count violent crime based on arrest column.
crimes_2015_drug_possession['violent_crime_count'] = ''
crimes_2015_drug_possession['violent_crime_count'] = crimes_2015_drug_possession[crimes_2015_drug_possession.Arrest == True].groupby(pd.TimeGrouper('D')).count()
Upvotes: 0
Views: 497
Reputation: 402613
Data taken from https://data.cityofchicago.org/Public-Safety/Crimes-2015/vwwp-7yr9/data
For the first bit, I'd recommend using df.isin
, it's much faster:
m = crimes_2015_drug_possession.Description.isin(various_drug_off)
m.head(5)
Date
2015-01-01 00:00:00 False
2015-11-24 17:30:00 False
2015-05-19 01:12:00 False
2015-01-01 00:00:00 False
2015-06-24 06:00:00 True
Name: Description, dtype: bool
crimes_2015_drug_possession['drug_possession'] = m
For the groupby
operation, observe:
crimes_2015_drug_possession[crimes_2015_drug_possession.Arrest == True].groupby(pd.TimeGrouper('D')).count().shape
(365, 21)
Notice it is not a single column, but you are trying to assign it to a single column. Now, I believe what you wanted was to count the number of Arrest
s:
c = crimes_2015_drug_possession.groupby(pd.TimeGrouper('D')).Arrest.count()
c.head(5)
Date
2015-01-01 1092
2015-01-02 671
2015-01-03 648
2015-01-04 513
2015-01-05 520
Freq: D, Name: Arrest, dtype: int64
This is still one column, however...
c.shape
(365,)
crimes_2015_drug_possession.shape
(263447, 21)
Their sizes are unequal. Assignment of unequal sizes will result in assignment by index, and unmatched values are replaced with NaN
. The result of the groupby operation cannot be assigned back to the original.
Upvotes: 1