Reputation: 87
Been trying to solve this for weeks, but can't seem to get it.
I have the following data frame:
post_id user_id
1 post-1 user1
2 post-2 user2
3 comment-1 user1
4 comment-2 user3
5 comment-3 user4
6 post-3 user2
7 comment-4 user2
And want to create a new variable parent_id. So that for each observation it should perform the following steps:
post_id
is either post
or comment
post_id
is post
then parent_id
should equal the earliest post_id
of the whole data frame. post_id
is the first post then parent_id
should equal NA
post_id
is comment
then parent_id
should equal to the first post_id
it encounters.The output should look something like:
post_id user_id parent_id_man
1 post-1 user1 NA
2 post-2 user2 post-1
3 comment-1 user1 post-2
4 comment-2 user3 post-2
5 comment-3 user4 post-2
6 post-3 user2 post-1
7 comment-4 user2 post-3
I have tried the following:
#Prepare data
df <- df %>% separate(post_id, into=c("type","number"), sep="-", remove=FALSE)
df$number <- as.numeric(df$number)
df <- df %>% mutate(comment_number = ifelse(type == "comment",number,99999))
df <- df %>% mutate(post_number = ifelse(type == "post",number,99999))
#Create parent_id column
df <- df %>% mutate(parent_id = ifelse(type == "post",paste("post-",min(post_number), sep=""),0))
df <- df %>% mutate(parent_id = ifelse(parent_id == post_id,"NA",parent_id))
df <- df %>% select(-comment_number, -post_number)
With that code I am able to perform Steps 1, 2 and 3, but step 4 is beyond me. I get the feeling that a certain type of conditional lagging based should be able to solve it, but can't come up with how to do it.
Any ideas would be very much appreciated!
Upvotes: 4
Views: 1757
Reputation: 51582
Building on your solution,
x <- which(df$type == 'post')
z <- which(df$type == 'comment')
df$parent_id[df$parent_id == 0] <- df$post_id[x[sapply(z, function(i) findInterval(i, x))]]
df$parent_id
#[1] "NA" "post-1" "post-2" "post-2" "post-2" "post-1" "post-3"
Upvotes: 1