Reputation: 25
I have been struggling for a while with what feels like should be a very simple operation, and have tried different methods but none of them appear to be fruitful.
I have a dataset that looks like this:
df <- data.frame(name = c("john", "paul", "ringo", "george", "john", "paul", "ringo", "george", "john", "paul", "ringo", "george"),
year = c(2018, 2018, 2018, 2018, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016),
station1 = c(1, 2, 3, NA, 2, NA, 5, 6, 7, 8, 9, 0),
station2 = c(NA, 6, 8, 1, 2, 6, NA, 1, NA, 1, 5, 3),
station3 = c(NA, 2, 3, 5, 1, NA, 1, 5, 3, 1, 2, 3),
station4 = c(9, 8, 7, 6, NA, 8, 12, 8, 83, 4, 3, NA))
Now, what I need, is to create a new variable, let's call it new_station, that takes a value conditional on each name at every given year. For example:
I have tried several combinations of filter, select and mutate, along the lines of:
df %>%
filter(name == "john") %>%
select(station1, station3) %>%
mutate(new_station = rowMeans(c(station1, station3)))
But it won't let me assign the value to only the values of a single row. Some other attempts lead each row in the new column, to be the mean of all of the 6 cells (2 stations x 3 years) when I just need the mean for that specific year. Other methods I have tried, are not able to deal with the fact that there are some missing values and I need those omitted.
I need a sort of loop that is scalable, just changing the conditions for each name, since in real life I have something like a dataset of 21 names and 30 stations.
Any thoughts?
Note: In case it illustrates what I'm trying to do, I know how to do this in Stata. In Stata, for the name john, it would look something like:
egen new_station = rowmean(station1 station3) if name == "john"
I just need to do something like that in R.
Thank you!
Upvotes: 2
Views: 135
Reputation: 11255
Here's a data.table solution. It relies on creating a lookup table and taking the rowMeans()
of a subset of a subset of data. :
library(data.table)
dt <- as.data.table(DF)
dt[, name := as.character(name)]
lookup <- list(john = c('station1', 'station3'),
paul = 'station4',
ringo = c('station1','station2','station3'),
george = 'station4')
dt[,
new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
by = name]
dt
Based on OP comments, it is safer to subset the dt
on the names of the lookup
table:
dt <- as.data.table(DF)
dt[, name := as.character(name)]
lookup[[4]] <- NULL
setdiff(dt[, name], names(lookup))
# error
dt[,
new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
by = name]
# OK
dt[name %in% names(lookup),
new_station := .SD[, rowMeans(.SD), .SDcols = lookup[[unlist(.BY)]]],
by = name]
dt
To better understand what is happening, I recommend running the following lines:
dt <- as.data.table(DF)
# what is .SD?
dt[, print(.SD), by = name]
dt[, .SD[,print(.SD) , .SDcols = lookup[[unlist(.BY)]]], by = name]
#what is .BY?
dt[, print(.BY), by = name]
dt[, print(unlist(.BY)), by = name]
dt[, name := as.character(name)]
dt[, print(unlist(.BY)), by = name]
References:
Great explanation of Subset of Data.table: What does .SD stand for in data.table in R
Upvotes: 0
Reputation: 146030
I would convert the data to a long format, then use case_when
. You can convert it back to wide when you're done, if you like.
df$id = 1:nrow(df)
library(tidyr)
df %>% pivot_longer(
cols = starts_with("station"),
names_to = "station", names_prefix = "station",
values_to = "value"
) %>%
group_by(name, year) %>%
mutate(result = case_when(
name == "john" ~ mean(value[station %in% c(1, 3)], na.rm = TRUE),
name %in% c("paul", "george") ~ value[station == 4],
name == "ringo" ~ mean(value[station %in% c(1, 2, 3)], na.rm = TRUE)
))
# # A tibble: 48 x 6
# # Groups: name, year [12]
# name year id station value result
# <fct> <dbl> <int> <chr> <dbl> <dbl>
# 1 john 2018 1 1 1 1
# 2 john 2018 1 2 NA 1
# 3 john 2018 1 3 NA 1
# 4 john 2018 1 4 9 1
# 5 paul 2018 2 1 2 8
# 6 paul 2018 2 2 6 8
# 7 paul 2018 2 3 2 8
# 8 paul 2018 2 4 8 8
# 9 ringo 2018 3 1 3 4.67
# 10 ringo 2018 3 2 8 4.67
# # ... with 38 more rows
Upvotes: 3