Reputation: 671
I am working with a data set of changes over time and need to calculate the time at which the peak change occurs. I am running into a problem because some subjects have missing data (NA's).
Example:
library(dplyr)
Data <- structure(list(Subject = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L), .Label = c("1", "10", "11", "12", "13", "14", "16",
"17", "18", "19", "2", "20", "21", "22", "23", "24", "25", "26",
"27", "28", "29", "3", "31", "32", "4", "5", "7", "8", "9"), class = "factor"),
Close = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L
), .Label = c("High Predictability", "Low Predictability"
), class = "factor"), SOA = structure(c(2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L), .Label = c("Long SOA", "Short SOA"), class = "factor"),
Time = c(-66.68, -66.68, -66.68, -66.68, -33.34, -33.34,
-33.34, -33.34, 0, 0, 0, 0, 33.34, 33.34, 33.34, 33.34, 66.68,
66.68, 66.68, 66.68, -66.68, -66.68, -66.68, -66.68, -33.34,
-33.34, -33.34, -33.34, 0, 0, 0, 0, 33.34, 33.34, 33.34,
33.34, 66.68, 66.68, 66.68, 66.68), Pcent_Chng = c(0.12314,
0.048254, -0.098007, 0.023216, 0.20327, 0.08338, -0.15157,
0.030008, 0.26442, 0.12019, -0.22878, 0.035547, 0.31849,
0.15488, -0.26887, 0.038992, 0.39489, 0.15112, -0.31185,
0.02144, NA, 0.046474, NA, 0.17541, NA, 0.14975, NA, 0.3555,
NA, -0.1736, NA, 0.72211, NA, -0.32201, NA, 1.0926, NA, -0.39551,
0.72211, 1.4406)), class = "data.frame", row.names = c(NA, -40L
), .Names = c("Subject", "Close", "SOA", "Time", "Pcent_Chng"
))
I get an error with the following attempt:
Data %>%
group_by(Subject,Close,SOA) %>%
summarize(Peak_Pcent = max(Pcent_Chng),
Peak_Latency = Time[which.max(Pcent_Chng)])
The error is:
Error in summarise_impl(.data, dots) :
Column `Peak_Latency` must be length 1 (a summary value), not 0
This seems to be due to the NA's, which are only in some SOA
conditions. Using complete.cases()
with my actual data is too aggressive and removes too much data.
Is there a workaround to ignore the NA's?
Upvotes: 3
Views: 1721
Reputation: 39154
You have one group with Peak_Pcent
all is NA
, and the other group only with one Peak_Pcent
. I think it is better to filter out the group with Peak_Pcent
all is NA
, and set na.rm = TRUE
when using the max
function.
Data %>%
group_by(Subject,Close,SOA) %>%
filter(!all(is.na(Pcent_Chng))) %>% # Filter out groups with Pcent_Chng all is NA
summarize(Peak_Pcent = max(Pcent_Chng, na.rm = TRUE), # Set na.rm = TRUE
Peak_Latency = Time[which.max(Pcent_Chng)])
# # A tibble: 7 x 5
# # Groups: Subject, Close [?]
# Subject Close SOA Peak_Pcent Peak_Latency
# <fctr> <fctr> <fctr> <dbl> <dbl>
# 1 1 High Predictability Long SOA 0.154880 33.34
# 2 1 High Predictability Short SOA 0.394890 66.68
# 3 1 Low Predictability Long SOA 0.038992 33.34
# 4 1 Low Predictability Short SOA -0.098007 -66.68
# 5 14 High Predictability Long SOA 0.149750 -33.34
# 6 14 Low Predictability Long SOA 1.440600 66.68
# 7 14 Low Predictability Short SOA 0.722110 66.68
Upvotes: 1
Reputation: 400
This should do the trick:
Data %>%
group_by(Subject, Close, SOA) %>%
mutate(Peak_Pcent = max(Pcent_Chng)) %>%
arrange(Subject, Close, SOA) %>%
filter(Peak_Pcent == Pcent_Chng)
The output:
# A tibble: 6 x 6
# Groups: Subject, Close, SOA [6]
Subject Close SOA Time Pcent_Chng Peak_Pcent
<fctr> <fctr> <fctr> <dbl> <dbl> <dbl>
1 1 High Predictability Long SOA 33.34 0.154880 0.154880
2 1 High Predictability Short SOA 66.68 0.394890 0.394890
3 1 Low Predictability Long SOA 33.34 0.038992 0.038992
4 1 Low Predictability Short SOA -66.68 -0.098007 -0.098007
5 14 High Predictability Long SOA -33.34 0.149750 0.149750
6 14 Low Predictability Long SOA 66.68 1.440600 1.440600
Upvotes: 0