SEMson
SEMson

Reputation: 1355

Use dplyr´s filter and mutate to generate a new variable

i choose the hflights-dataset as an example.

I try to create a variable/column that contains the "TailNum" from the planes, but only for the planes that are under the 10% with the longest airtime.

install.packages("hflights") 
library("hflights") 
flights <-tbl_df(hflights) 
flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>% mutate(new_var=TailNum)

EDIT: The resulting dataframe has only 22208 obs instead of 227496. Is there a way to keep the original dataframe, but add a new variable with the TeilNum for the planes with top10-percent airtime?

Upvotes: 2

Views: 4504

Answers (1)

r.bot
r.bot

Reputation: 5424

You don't need the flights in mutate() after the pipe.

flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>% mutate(new = TailNum)

Also, new is a function, so best avoid that as a variable name. See ?new. As an illustration:

flights <-tbl_df(hflights) 
flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>% 
+   mutate(new_var = TailNum, new = TailNum) %>%
+   select(AirTime, TailNum, new_var)
Source: local data frame [22,208 x 3]

   AirTime TailNum new_var
1      255  N614AS  N614AS
2      257  N627AS  N627AS
3      260  N627AS  N627AS
4      268  N618AS  N618AS
5      273  N607AS  N607AS
6      278  N624AS  N624AS
7      274  N611AS  N611AS
8      269  N607AS  N607AS
9      253  N609AS  N609AS
10     315  N626AS  N626AS
..     ...     ...     ...

To retain all observations, lose the filter(). My normal approach is to use ifelse() instead. Others may be able to suggest a better solution.

f2 <- flights %>% mutate(cumdist = cume_dist(desc(AirTime)), 
                   new_var = ifelse(cumdist < 0.1, TailNum, NA)) %>%
  select(AirTime, TailNum, cumdist, new_var)

table(is.na(f2$new_var))

 FALSE   TRUE 
 22208 205288 

Upvotes: 4

Related Questions