Reputation: 43
I have the following data frame with me in R (for anyone familiar with tidyverse, it's the starwars sample dataset)
I'm trying to create a tibble that outputs two columns: homeworld
, and shortest_5
(average height of shortest 5 people from that homeworld).
Below is my code;
df<-starwars %>%
group_by(homeworld) %>%
filter(!is.na(height), !is.na(homeworld)) %>%
arrange(desc(height)) %>%
mutate(last5mean = mean(tail(height, 5))) %>%
summarize(shortest_5=last5mean, number=n()) %>%
filter(number>=5, )
df
It seems that I've successfully done so (though it is quite messy). My problem is that though my tibble does list homeworld
and shortest_5
, it repeats multiple instances of the same homeworld
.
Seems like a simple fix but I can't quite wrap my head around it! Any help would be really appreciated!
Upvotes: 2
Views: 1973
Reputation: 475
You can get rid of duplicate data using the duplicate()
function
For Example
df <- c(1,1,2,3,4,4,5,6,10,10,10)
Check out which data are duplicated
df[duplicated(df)] # notice it shows 1, 4, and 10 (Note: need to add a comma if your df has more than one column, such as here: New_DF <- df[!duplicated(df),]
Remove duplicates
New_DF <- df[!duplicated(df)] # all duplicate data removed
Upvotes: 3
Reputation: 10996
You can considerably shorten your code:
df<-starwars %>%
group_by(homeworld) %>%
filter(!is.na(height), !is.na(homeworld), n() >=5) %>%
summarize(shortest_5 = mean(if_else(rank(height) > 5, NA_integer_, height), na.rm = TRUE))
df
# # A tibble: 2 x 2
# homeworld shortest_5
# <chr> <dbl>
# 1 Naboo 151.
# 2 Tatooine 153.
Note:
Upvotes: 2