Reputation: 23
I need to drop the feature 'county' that is unique to every row and therefore have no value in my machine learning process.
However, the below code is not removing the unique values, for just county, as they are still in my dataset? HELP.
# counting unique values
n = len(pd.unique(data['county']))
print("No.of.unique values :",
n)
data[data.groupby('county')['county'].transform('size') > 1]
data
state | county |
---|---|
AL | Barbour County |
AL | Barbour County |
WY | Sweetwater County |
I also tried
data = data[data.duplicated(subset=['county'], keep=False)]
no luck.
Upvotes: 1
Views: 4243
Reputation: 126
Both snippets that you posted should be removing the rows with Unique values for county
data[data.groupby('county')['county'].transform('size') > 1]
I would make a small correction to your snippet by assigning the variable:
non_unique_data = data[data.groupby('county')['county'].transform('size') > 1]
Same with your second snippet.
non_unique_data = data[data.duplicated(subset=['county'], keep=False)]
Now when you check your variable value for
non_unique_data
You'll see that it's only kept the data with duplicate values in county
I think you're having issues because you're assigning your result to the original dataframe.
Upvotes: 2