metaltoaster
metaltoaster

Reputation: 378

Filtering ID's based on specified difference in values

I am trying to filter IDs based on a specified conditions. For instance, I want to filter IDs which have a particular difference in questionnaire score from pre to post treatment. The idea is to get IDs that have improved in score, have stayed the same or have worsened. Here is a mock data set of what I'm trying to achieve:-

    ID<-c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
    Condition<-c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
    Score<-c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11)
    df<-cbind(ID,Condition,Score)
    df<-as.data.frame(df)
    df$Condition<-as.factor(df$Condition)

The main issue here is that there are ID's who appear in the data, for both pre and post, twice.

I have tried using a dplyr solution to select the appropriate columns from the main data frame, and then use tidyverse, and the spread function to convert to wide format, because from there I can work out the differences easily enough. However, there is a particular problem that I've experienced. It wouldn't work because there are duplicate instances where an ID has appeared in the data again (for example, IDs aaa, bbb & ccc).

     df2<-df%>%
     group_by(ID)%>%
     spread(Condition, Score)

This gets me the following error message:-

Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 12 rows: * 10, 22 * 11, 23 * 12, 24 * 1, 19 * 2, 20 * 3, 21 Do you need to create unique ID with tibble::rowid_to_column()?

Ideally, the kind of result I would be looking for is like so:-

    #improved
    ID      Pre       Post     Difference
    aaa      23        17           -6
    bbb      20        17           -3
    ggg      20        14           -6
    hhh      19        15           -4
    iii      18        10           -8
    aaa      23        20           -3
    bbb      23        18           -5
    ccc      21        11           -10


    #no improvement
    ID      Pre       Post      Difference
    ccc      19         19          0
    eee      22         22          0
    fff      22         22          0


    #worsened
    ID      Pre       Post      Difference
    ddd      15         20          +5

Or something to this effect. As long as it allows me to include the repeated IDs. Ideally, I would like to be able to conditionally filter further based on the magnitude of difference. So for instance, if I want to subset/filter IDs with an improved score of more than 5, or a worsened score of more than 5. Bare in mind, my actual data set will have far more IDs to work with than the example that I just made up and provided. Any help would be greatly appreciated, as always.

Thank you in advance :)

Upvotes: 2

Views: 964

Answers (3)

Cole
Cole

Reputation: 11255

The other answers address Score being a factor due to the cbind() call. Here are solutions for Base R, data.table, and dplyr.

All of the solutions address the duplicate ID by adding an additional Group variable. This allows the spread to be successful.

# Base R ------------------------------------------------------------------

df <- data.frame(ID, Condition, Score)
df$Group <- ave(seq_len(nrow(df)), df$Condition, FUN = seq_along)

df_wide <- reshape(df, timevar = 'Condition', idvar = c('ID', 'Group'), direction = 'wide')
df_wide$Difference <- df_wide$Score.Post - df_wide$Score.Pre
df_wide[order(df_wide$Difference),]

# data.table --------------------------------------------------------------
library(data.table)

dt <- data.table(ID, Condition, Score)
dt[, Group := seq_len(.N), by = Condition]

dt_wide <- dcast(dt, ID + Group ~ Condition, value.var = 'Score')
dt_wide[, Difference := Post - Pre]
dt_wide[order(Difference),]

# dplyr -------------------------------------------------------------------
library(tidyverse)

tib <- tibble(ID, Condition, Score)

tib%>%
  group_by(Condition)%>%
  mutate(Group = row_number())%>%
  ungroup()%>%
  spread(key = 'Condition', value = 'Score')%>%
  mutate(Difference = Post - Pre)%>%
  arrange(Difference)

For this very small dataset, base R is fastest and data.table is slowest.

Unit: milliseconds
           expr    min      lq     mean  median      uq     max neval
     base_r_way 2.7562 2.98075 3.103155 3.05140 3.12810  6.0653   100
 data.table_way 6.6137 7.09705 8.216043 7.44250 8.01885 47.9138   100
      dplyr_way 4.7334 5.15005 5.350857 5.25085 5.40395  9.5594   100

And data:

ID <- c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
Condition <- c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
Score <- as.integer(c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11))

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 40011

A different tidyverse possibility could be:

df %>%
 mutate_if(is.factor, as.character) %>%
 mutate(Score = as.numeric(Score)) %>%
 group_by(Condition) %>%
 mutate(ID = make.unique(ID)) %>%
 group_by(ID) %>%
 mutate(Difference = Score - lag(Score)) %>%
 spread(Condition, Score) %>%
 summarise_all(max, na.rm = TRUE) %>%
 arrange(Difference)

   ID    Difference  Post   Pre
   <chr>      <dbl> <dbl> <dbl>
 1 ccc.1        -10    11    21
 2 iii           -8    10    18
 3 aaa           -6    17    23
 4 ggg           -6    14    20
 5 bbb.1         -5    18    23
 6 hhh           -4    15    19
 7 aaa.1         -3    20    23
 8 bbb           -3    17    20
 9 ccc            0    19    19
10 eee            0    22    22
11 fff            0    22    22
12 ddd            5    20    15

Here it, first, creates unique IDs. Second, it calculates the difference. Finally, it transforms it into a wide format and arranges it according the differences.

If for some reasons you need to split it according differences, you can add the last line from @akrun's code:

df %>%
 mutate_if(is.factor, as.character) %>%
 mutate(Score = as.numeric(Score)) %>%
 group_by(Condition) %>%
 mutate(ID = make.unique(ID)) %>%
 group_by(ID) %>%
 mutate(Difference = Score - lag(Score)) %>%
 spread(Condition, Score) %>%
 summarise_all(max, na.rm = TRUE) %>%
 group_split(sign(Difference), keep = FALSE)

[[1]]
# A tibble: 8 x 4
  ID    Difference  Post   Pre
  <chr>      <dbl> <dbl> <dbl>
1 aaa           -6    17    23
2 aaa.1         -3    20    23
3 bbb           -3    17    20
4 bbb.1         -5    18    23
5 ccc.1        -10    11    21
6 ggg           -6    14    20
7 hhh           -4    15    19
8 iii           -8    10    18

[[2]]
# A tibble: 3 x 4
  ID    Difference  Post   Pre
  <chr>      <dbl> <dbl> <dbl>
1 ccc            0    19    19
2 eee            0    22    22
3 fff            0    22    22

[[3]]
# A tibble: 1 x 4
  ID    Difference  Post   Pre
  <chr>      <dbl> <dbl> <dbl>
1 ddd            5    20    15

Upvotes: 2

akrun
akrun

Reputation: 887421

An option would be to first convert the 'Score' to numeric from factor, grouped by 'ID' 'Condition', create a sequence column ('rn'), spread to 'wide' format, get the diifference of 'Post' and 'Pre' Scores, and split by the sign of 'Difference' column to create a list of tibbles

library(tidyverse)
df %>% 
   mutate(Score = as.numeric(as.character(Score))) %>% 
   group_by(ID, Condition) %>% 
   mutate(rn = row_number()) %>% 
   spread(Condition, Score) %>% 
   mutate(Difference = Post -Pre) %>% 
   ungroup %>% 
   select(-rn) %>%
   group_split(grp = sign(Difference), keep = FALSE)
#[[1]]
# A tibble: 8 x 4
#  ID     Post   Pre Difference
#  <fct> <dbl> <dbl>      <dbl>
#1 aaa      17    23         -6
#2 aaa      20    23         -3
#3 bbb      17    20         -3
#4 bbb      18    23         -5
#5 ccc      11    21        -10
#6 ggg      14    20         -6
#7 hhh      15    19         -4
#8 iii      10    18         -8

#[[2]]
# A tibble: 3 x 4
#  ID     Post   Pre Difference
#  <fct> <dbl> <dbl>      <dbl>
#1 ccc      19    19          0
#2 eee      22    22          0
#3 fff      22    22          0

#[[3]]
# A tibble: 1 x 4
#  ID     Post   Pre Difference
#  <fct> <dbl> <dbl>      <dbl>
#1 ddd      20    15          5

NOTE: It is not advisable to use as.data.frame(cbind, as cbind converts to a matrix and matrix can hold only a single class, i.e.if there is a character column, all the other columns get converted to character and wrapping with as.data.frame (default option is stringsAsFactors = TRUE).

df <- data.frame(...) #directly create

Upvotes: 2

Related Questions