Reputation: 45
I have a dataframe of the populations of cities. In that data I am storing the city name, county, and state. Unfortunately, in my data there are duplicates of an entry. The duplicates are not exactly the same as they may contain slightly different coordinates for the location of the city or a slightly different population. So the typical distinct()
won't work here. Is there away that I can go through and remove other rows with the same city, county, and state but not care if other variables are unique or shared?
Upvotes: 1
Views: 41
Reputation: 3791
To explain my comment and turn it to an answer as per your request:-
df[!duplicated(df[,c('city','country', 'state')]),]
df[,c('city','country', 'state')]
Subsets all rows of the column city
, country
and state
. A subset in r
works as data_frame[row, column]
. Leaving the row
selector empty returns all row in the dataframe, and since we're passing multiple columns we wrap it inside a c()
.
!duplicated()
returns a logical vector TRUE
if city
, country
and state
are not a duplicate or FALSE
if it is a duplicate.
Then we wrap that inside a df
as a row selector
and return all columns
.
df[!duplicated(df[,c('city','country', 'state')]),]
!duplicated(df[,c('city','country', 'state')])
will serve as a row selector and leaving the column selector empty will return all the columns. !
is a negative marker by the way.
In short all columns where !duplicated(df[,c('city','country', 'state')])
is TRUE
(rows which are not duplicates) will be returned.
Hope you can build on that.
Upvotes: 1