bob
bob

Reputation: 23

How to keep ONLY duplicated values in Pandas Dataframe?

I need to drop the feature 'county' that is unique to every row and therefore have no value in my machine learning process.

However, the below code is not removing the unique values, for just county, as they are still in my dataset? HELP.

# counting unique values
n = len(pd.unique(data['county']))
  
print("No.of.unique values :", 
      n)


data[data.groupby('county')['county'].transform('size') > 1]
data
state county
AL Barbour County
AL Barbour County
WY Sweetwater County

I also tried

data = data[data.duplicated(subset=['county'], keep=False)]

no luck.

Upvotes: 1

Views: 4243

Answers (1)

seoboss
seoboss

Reputation: 126

Both snippets that you posted should be removing the rows with Unique values for county

data[data.groupby('county')['county'].transform('size') > 1]

I would make a small correction to your snippet by assigning the variable:

non_unique_data = data[data.groupby('county')['county'].transform('size') > 1]

Same with your second snippet.

non_unique_data = data[data.duplicated(subset=['county'], keep=False)]

Now when you check your variable value for

non_unique_data

You'll see that it's only kept the data with duplicate values in county

I think you're having issues because you're assigning your result to the original dataframe.

Upvotes: 2

Related Questions