Remove a row from a data-frame if three of the columns already appear in another row

Question

I have a dataframe of the populations of cities. In that data I am storing the city name, county, and state. Unfortunately, in my data there are duplicates of an entry. The duplicates are not exactly the same as they may contain slightly different coordinates for the location of the city or a slightly different population. So the typical distinct() won't work here. Is there away that I can go through and remove other rows with the same city, county, and state but not care if other variables are unique or shared?

deepseefan · Accepted Answer

To explain my comment and turn it to an answer as per your request:-

df[!duplicated(df[,c('city','country', 'state')]),]

df[,c('city','country', 'state')]

Subsets all rows of the column city, country and state. A subset in r works as data_frame[row, column]. Leaving the row selector empty returns all row in the dataframe, and since we're passing multiple columns we wrap it inside a c().

!duplicated()

returns a logical vector TRUE if city, country and state are not a duplicate or FALSE if it is a duplicate.

Then we wrap that inside a df as a row selector and return all columns.

df[!duplicated(df[,c('city','country', 'state')]),]

!duplicated(df[,c('city','country', 'state')]) will serve as a row selector and leaving the column selector empty will return all the columns. ! is a negative marker by the way.

In short all columns where !duplicated(df[,c('city','country', 'state')]) is TRUE (rows which are not duplicates) will be returned.

Hope you can build on that.

Remove a row from a data-frame if three of the columns already appear in another row

Answers (1)

Related Questions