alf10087
alf10087

Reputation: 25

Generate new variable with conditional means and NA's

I have been struggling for a while with what feels like should be a very simple operation, and have tried different methods but none of them appear to be fruitful.

I have a dataset that looks like this:

df <- data.frame(name = c("john", "paul", "ringo", "george", "john", "paul", "ringo", "george", "john", "paul", "ringo", "george"), 
                 year = c(2018, 2018, 2018, 2018, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016),
                 station1 = c(1, 2, 3, NA, 2, NA, 5, 6, 7, 8, 9, 0),
                 station2 = c(NA, 6, 8, 1, 2, 6, NA, 1, NA, 1, 5, 3),
                 station3 = c(NA, 2, 3, 5, 1, NA, 1, 5, 3, 1, 2, 3),
                 station4 = c(9, 8, 7, 6, NA, 8, 12, 8, 83, 4, 3, NA))

Now, what I need, is to create a new variable, let's call it new_station, that takes a value conditional on each name at every given year. For example:

I have tried several combinations of filter, select and mutate, along the lines of:

df %>%
  filter(name == "john") %>%
  select(station1, station3) %>%
  mutate(new_station = rowMeans(c(station1, station3)))

But it won't let me assign the value to only the values of a single row. Some other attempts lead each row in the new column, to be the mean of all of the 6 cells (2 stations x 3 years) when I just need the mean for that specific year. Other methods I have tried, are not able to deal with the fact that there are some missing values and I need those omitted.

I need a sort of loop that is scalable, just changing the conditions for each name, since in real life I have something like a dataset of 21 names and 30 stations.

Any thoughts?

Note: In case it illustrates what I'm trying to do, I know how to do this in Stata. In Stata, for the name john, it would look something like:

egen new_station = rowmean(station1 station3) if name == "john"

I just need to do something like that in R.

Thank you!

Upvotes: 2

Views: 135

Answers (2)

Cole
Cole

Reputation: 11255

Here's a solution. It relies on creating a lookup table and taking the rowMeans() of a subset of a subset of data. :

library(data.table)

dt <- as.data.table(DF)
dt[, name := as.character(name)]

lookup <- list(john = c('station1', 'station3'),
               paul = 'station4',
               ringo = c('station1','station2','station3'),
               george = 'station4')

dt[,
   new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
   by = name]
dt

Based on OP comments, it is safer to subset the dt on the names of the lookup table:

dt <- as.data.table(DF)
dt[, name := as.character(name)]

lookup[[4]] <- NULL
setdiff(dt[, name], names(lookup))

# error
dt[,
   new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
   by = name]
# OK
dt[name %in% names(lookup),
   new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
   by = name]

dt

To better understand what is happening, I recommend running the following lines:

dt <- as.data.table(DF)
# what is .SD?
dt[, print(.SD), by = name]
dt[, .SD[,print(.SD) , .SDcols = lookup[[unlist(.BY)]]], by = name]

#what is .BY?
dt[, print(.BY), by = name]
dt[, print(unlist(.BY)), by = name]
dt[, name := as.character(name)]
dt[, print(unlist(.BY)), by = name]

References:

Great explanation of Subset of Data.table: What does .SD stand for in data.table in R

Upvotes: 0

Gregor Thomas
Gregor Thomas

Reputation: 146030

I would convert the data to a long format, then use case_when. You can convert it back to wide when you're done, if you like.

df$id = 1:nrow(df)

library(tidyr) 
df %>% pivot_longer(
    cols = starts_with("station"), 
    names_to = "station", names_prefix = "station",
    values_to = "value"
  ) %>%
  group_by(name, year) %>%
  mutate(result = case_when(
    name == "john" ~ mean(value[station %in% c(1, 3)], na.rm = TRUE),
    name %in% c("paul", "george") ~ value[station == 4],
    name == "ringo" ~ mean(value[station %in% c(1, 2, 3)], na.rm = TRUE)
  ))
# # A tibble: 48 x 6
# # Groups:   name, year [12]
#    name   year    id station value result
#    <fct> <dbl> <int> <chr>   <dbl>  <dbl>
#  1 john   2018     1 1           1   1   
#  2 john   2018     1 2          NA   1   
#  3 john   2018     1 3          NA   1   
#  4 john   2018     1 4           9   1   
#  5 paul   2018     2 1           2   8   
#  6 paul   2018     2 2           6   8   
#  7 paul   2018     2 3           2   8   
#  8 paul   2018     2 4           8   8   
#  9 ringo  2018     3 1           3   4.67
# 10 ringo  2018     3 2           8   4.67
# # ... with 38 more rows

Upvotes: 3

Related Questions