Reputation: 1040

Using the lag function with two different columns R

I have data such as this.

data.input <-read_table2("user.id   problem.id  first.correct   cwa_prob    prob.seq    
540995  PRABNBZQ    0   1   problem.id.problem.1    
540995  PRABNBZ2    1   0   problem.id.problem.2    
540995  PRABNBZK    0   1   problem.id.problem.3    
540995  PRABNB2B    0   1   problem.id.problem.4    
540995  PRABNBWE    0   1   problem.id.problem.5    
565662  PRABNB2B    1   0   problem.id.problem.1    
565662  PRABNBZQ    1   1   problem.id.problem.2    
565662  PRABNBZ9    1   0   problem.id.problem.3    
565662  PRABNBZV    1   0   problem.id.problem.4    
565662  PRABNBWG    1   0   problem.id.problem.5    
")

I am trying to use the lag function to compare the occurrence of a flag in two separate columns. I want to see how a person fairs sequentially from one problem to the next. Specifically, if cwa_prob is 1 for problem.id.problem.1 (user 540995), is it the case that first.correct==1 in problem.id.problem.2 (user 540995).

I tried something like this, but it throws me an error.

input.data %>% 
  group_by(user.id) %>% 
  mutate(post.cwa.correct=ifelse(cwa_prob==1 & lag(first.correct==1),1,0))

I would like my output to look something like this:

data.output <-read_table2("user.id  problem.id  first.correct   cwa_prob    prob.seq    post.cwa.correct
540995  PRABNBZQ    0   1   problem.id.problem.1    NA
540995  PRABNBZ2    1   0   problem.id.problem.2    1
540995  PRABNBZK    0   1   problem.id.problem.3    NA
540995  PRABNB2B    0   1   problem.id.problem.4    0
540995  PRABNBWE    0   1   problem.id.problem.5    0
565662  PRABNB2B    1   0   problem.id.problem.1    1
565662  PRABNBZQ    1   1   problem.id.problem.2    NA
565662  PRABNBZ9    1   0   problem.id.problem.3    1
565662  PRABNBZV    1   0   problem.id.problem.4    NA
565662  PRABNBWG    1   0   problem.id.problem.5    NA
")

Maybe the lag function isn't the best function for this? Any help is appreciated!

Upvotes: 0

Answers (3)

Ronak Shah

Reputation: 389325

Based on your description I think this is what you want :

library(dplyr)

data.input %>%
  group_by(user.id) %>%
  mutate(post.cwa.correct = as.integer(first.correct == 1 & lag(cwa_prob) == 1))

This will give a value of 1 when the current row has first.correct = 1 and the previous row has cwa_prob = 1 .

Note that you don't need ifelse(condition, 1, 0) since you can convert TRUE/FALSE values to 1/0 by converting them to integer. Also in your attempt you have lag(first.correct==1) which is different from lag(first.correct) ==1.

Upvotes: 1

James

Reputation: 589

I ran everything, and it worked fine. However the output you list as wanted doesn't actually meet your criteria...you'd want

data.input %>% 
  group_by(user.id) %>% 
  mutate(post.cwa.correct=ifelse(cwa_prob==0, NA, ifelse(
    cwa_prob==1 & lead(first.correct==1),1,0)
  ))

However, you list the dataset as "data.input," but then the dataset you use is listed as "input.data."

Upvotes: 0

Simon.S.A.

Reputation: 6941

Take a look at the lag documentation. There is an order_by argument that determines the sort order of your records.

Unless you sent this argument, R uses the existing/default order for your dataset. So different ordering will produce different answers. This is a likely explanation for why your code has not worked, but @James founds that the code ran fine.

input.data %>% 
  group_by(user.id) %>% 
  mutate(post.cwa.correct=ifelse(cwa_prob==1 & lag(first.correct==1, order_by = "prob.seq"),1,0))

Note that as the format of prob_seq is most likely character, the order_by clause will sort alphabetically. This means 'problem.10' will come before 'problem.2'. So you'll need to extract and convert the problem number to numeric if this occurs in your data.

Upvotes: 0

Using the lag function with two different columns R

Answers (3)

Related Questions