Reputation: 83
I have a dataframe where one column is a categorical variable with the following labels: ['Short', 'Medium', 'Long', 'Very Long', 'Extremely Long']
. I am trying to create a new dataframe that drops all the rows that are Extremely Long
.
I have tried doing this in the following ways:
df2 = df.query('ride_type != "Extremely Long"')
df2 = df[df['ride_type'] != 'Extremely Long']
However, when I run .value_counts() I get the following:
df2.ride_type.value_counts()
>>> Short 130474
Long 129701
Medium 129607
Very Long 110988
Extremely Long 0
Name: ride_type, dtype: int64
In other words, Extremely Long
is still there, so I can't plot charts with just the four categories I want.
Upvotes: 6
Views: 4131
Reputation: 7594
You could drop rows like this:
df = df.drop(df.index[df['A'] == 'cat'])
print(df['A'].value_counts())
dog 2
rabbit 2
Name: A, dtype: int64
Upvotes: 0
Reputation: 402493
This is a feature of categorical data. You may have something that looks like this:
df = pd.DataFrame({'ride_type': pd.Categorical(
['Long', 'Long'], categories=['Long', 'Short'])})
df
ride_type
0 Long
1 Long
Calling value_counts
on a categorical column will record counts for all categories, not just the ones present.
df['ride_type'].value_counts()
Long 2
Short 0
Name: ride_type, dtype: int64
The solution is to either remove unused categories, or convert to string:
df['ride_type'].cat.remove_unused_categories().value_counts()
Long 2
Name: ride_type, dtype: int64
# or,
df['ride_type'].astype(str).value_counts()
Long 2
Name: ride_type, dtype: int64
Upvotes: 11