Reputation: 378
I am trying to filter IDs based on a specified conditions. For instance, I want to filter IDs which have a particular difference in questionnaire score from pre to post treatment. The idea is to get IDs that have improved in score, have stayed the same or have worsened. Here is a mock data set of what I'm trying to achieve:-
ID<-c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
Condition<-c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
Score<-c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11)
df<-cbind(ID,Condition,Score)
df<-as.data.frame(df)
df$Condition<-as.factor(df$Condition)
The main issue here is that there are ID's who appear in the data, for both pre and post, twice.
I have tried using a dplyr
solution to select the appropriate columns from the main data frame, and then use tidyverse
, and the spread
function to convert to wide format, because from there I can work out the differences easily enough. However, there is a particular problem that I've experienced. It wouldn't work because there are duplicate instances where an ID has appeared in the data again (for example, IDs aaa, bbb & ccc).
df2<-df%>%
group_by(ID)%>%
spread(Condition, Score)
This gets me the following error message:-
Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 12 rows: * 10, 22 * 11, 23 * 12, 24 * 1, 19 * 2, 20 * 3, 21 Do you need to create unique ID with tibble::rowid_to_column()?
Ideally, the kind of result I would be looking for is like so:-
#improved
ID Pre Post Difference
aaa 23 17 -6
bbb 20 17 -3
ggg 20 14 -6
hhh 19 15 -4
iii 18 10 -8
aaa 23 20 -3
bbb 23 18 -5
ccc 21 11 -10
#no improvement
ID Pre Post Difference
ccc 19 19 0
eee 22 22 0
fff 22 22 0
#worsened
ID Pre Post Difference
ddd 15 20 +5
Or something to this effect. As long as it allows me to include the repeated IDs. Ideally, I would like to be able to conditionally filter further based on the magnitude of difference. So for instance, if I want to subset/filter IDs with an improved score of more than 5, or a worsened score of more than 5. Bare in mind, my actual data set will have far more IDs to work with than the example that I just made up and provided. Any help would be greatly appreciated, as always.
Thank you in advance :)
Upvotes: 2
Views: 964
Reputation: 11255
The other answers address Score
being a factor due to the cbind()
call. Here are solutions for Base R, data.table, and dplyr.
All of the solutions address the duplicate ID
by adding an additional Group
variable. This allows the spread
to be successful.
# Base R ------------------------------------------------------------------
df <- data.frame(ID, Condition, Score)
df$Group <- ave(seq_len(nrow(df)), df$Condition, FUN = seq_along)
df_wide <- reshape(df, timevar = 'Condition', idvar = c('ID', 'Group'), direction = 'wide')
df_wide$Difference <- df_wide$Score.Post - df_wide$Score.Pre
df_wide[order(df_wide$Difference),]
# data.table --------------------------------------------------------------
library(data.table)
dt <- data.table(ID, Condition, Score)
dt[, Group := seq_len(.N), by = Condition]
dt_wide <- dcast(dt, ID + Group ~ Condition, value.var = 'Score')
dt_wide[, Difference := Post - Pre]
dt_wide[order(Difference),]
# dplyr -------------------------------------------------------------------
library(tidyverse)
tib <- tibble(ID, Condition, Score)
tib%>%
group_by(Condition)%>%
mutate(Group = row_number())%>%
ungroup()%>%
spread(key = 'Condition', value = 'Score')%>%
mutate(Difference = Post - Pre)%>%
arrange(Difference)
For this very small dataset, base R is fastest and data.table is slowest.
Unit: milliseconds
expr min lq mean median uq max neval
base_r_way 2.7562 2.98075 3.103155 3.05140 3.12810 6.0653 100
data.table_way 6.6137 7.09705 8.216043 7.44250 8.01885 47.9138 100
dplyr_way 4.7334 5.15005 5.350857 5.25085 5.40395 9.5594 100
And data:
ID <- c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
Condition <- c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
Score <- as.integer(c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11))
Upvotes: 1
Reputation: 40011
A different tidyverse
possibility could be:
df %>%
mutate_if(is.factor, as.character) %>%
mutate(Score = as.numeric(Score)) %>%
group_by(Condition) %>%
mutate(ID = make.unique(ID)) %>%
group_by(ID) %>%
mutate(Difference = Score - lag(Score)) %>%
spread(Condition, Score) %>%
summarise_all(max, na.rm = TRUE) %>%
arrange(Difference)
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 ccc.1 -10 11 21
2 iii -8 10 18
3 aaa -6 17 23
4 ggg -6 14 20
5 bbb.1 -5 18 23
6 hhh -4 15 19
7 aaa.1 -3 20 23
8 bbb -3 17 20
9 ccc 0 19 19
10 eee 0 22 22
11 fff 0 22 22
12 ddd 5 20 15
Here it, first, creates unique IDs. Second, it calculates the difference. Finally, it transforms it into a wide format and arranges it according the differences.
If for some reasons you need to split it according differences, you can add the last line from @akrun's code:
df %>%
mutate_if(is.factor, as.character) %>%
mutate(Score = as.numeric(Score)) %>%
group_by(Condition) %>%
mutate(ID = make.unique(ID)) %>%
group_by(ID) %>%
mutate(Difference = Score - lag(Score)) %>%
spread(Condition, Score) %>%
summarise_all(max, na.rm = TRUE) %>%
group_split(sign(Difference), keep = FALSE)
[[1]]
# A tibble: 8 x 4
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 aaa -6 17 23
2 aaa.1 -3 20 23
3 bbb -3 17 20
4 bbb.1 -5 18 23
5 ccc.1 -10 11 21
6 ggg -6 14 20
7 hhh -4 15 19
8 iii -8 10 18
[[2]]
# A tibble: 3 x 4
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 ccc 0 19 19
2 eee 0 22 22
3 fff 0 22 22
[[3]]
# A tibble: 1 x 4
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 ddd 5 20 15
Upvotes: 2
Reputation: 887421
An option would be to first convert the 'Score' to numeric
from factor
, grouped by 'ID' 'Condition', create a sequence column ('rn'), spread
to 'wide' format, get the diifference of 'Post' and 'Pre' Scores, and split
by the sign
of 'Difference' column to create a list
of tibble
s
library(tidyverse)
df %>%
mutate(Score = as.numeric(as.character(Score))) %>%
group_by(ID, Condition) %>%
mutate(rn = row_number()) %>%
spread(Condition, Score) %>%
mutate(Difference = Post -Pre) %>%
ungroup %>%
select(-rn) %>%
group_split(grp = sign(Difference), keep = FALSE)
#[[1]]
# A tibble: 8 x 4
# ID Post Pre Difference
# <fct> <dbl> <dbl> <dbl>
#1 aaa 17 23 -6
#2 aaa 20 23 -3
#3 bbb 17 20 -3
#4 bbb 18 23 -5
#5 ccc 11 21 -10
#6 ggg 14 20 -6
#7 hhh 15 19 -4
#8 iii 10 18 -8
#[[2]]
# A tibble: 3 x 4
# ID Post Pre Difference
# <fct> <dbl> <dbl> <dbl>
#1 ccc 19 19 0
#2 eee 22 22 0
#3 fff 22 22 0
#[[3]]
# A tibble: 1 x 4
# ID Post Pre Difference
# <fct> <dbl> <dbl> <dbl>
#1 ddd 20 15 5
NOTE: It is not advisable to use as.data.frame(cbind
, as cbind
converts to a matrix
and matrix
can hold only a single class, i.e.if there is a character column, all the other columns get converted to character
and wrapping with as.data.frame
(default option is stringsAsFactors = TRUE
).
df <- data.frame(...) #directly create
Upvotes: 2