How to remove duplicate rows in R?

Question

I have the following data frame with me in R (for anyone familiar with tidyverse, it's the starwars sample dataset)

I'm trying to create a tibble that outputs two columns: homeworld, and shortest_5 (average height of shortest 5 people from that homeworld).

Below is my code;

df<-starwars %>%
  group_by(homeworld) %>%
  filter(!is.na(height), !is.na(homeworld)) %>%
  arrange(desc(height)) %>%
  mutate(last5mean = mean(tail(height, 5))) %>%
  summarize(shortest_5=last5mean, number=n()) %>%
  filter(number>=5, ) 
df

It seems that I've successfully done so (though it is quite messy). My problem is that though my tibble does list homeworld and shortest_5, it repeats multiple instances of the same homeworld.

Seems like a simple fix but I can't quite wrap my head around it! Any help would be really appreciated!

deschen · Accepted Answer

You can considerably shorten your code:

df<-starwars %>%
  group_by(homeworld) %>%
  filter(!is.na(height), !is.na(homeworld), n() >=5) %>%
  summarize(shortest_5 = mean(if_else(rank(height) > 5, NA_integer_, height), na.rm = TRUE))

df

# # A tibble: 2 x 2
#   homeworld shortest_5
#             
# 1 Naboo           151.
# 2 Tatooine        153.

Note:

I get different results than you, e.g. on Naboo the shortest 5 characters have height: 96, 157, 165, 165, 170. And the mean of these 5 values is 150.6.
You shouldn't have values for e.g. Coruscant, since there are only 3 characters from that homeworld. The only two homeworlds with at least 5 characters are Naboo and Tatooine.

How to remove duplicate rows in R?

Answers (2)

Related Questions