just_rookie
just_rookie

Reputation: 893

add a column using dplyr in R based on if duplicated in other rows

I would like to add a column to dataframe based on condition if duplicated in other rows. My dataframe like this:

group label value   newColumn
1     1     3
1     2     4
1     3     3
1     4     5
1     5     4
2     1     6
2     2     3
2     3     9
2     4     6
2     5     1
2     6     3

I want to add a column:

if df$value[i] is duplicated and df$value[i] is the original, set newColumn[i] to 0; 
if df$value[i] is duplicated and df$value[i] is the duplicate, set newColumn[i] to the label of the original;
if df$value[i] is not duplicated, set df$newColumn[i] to 0.

for example:

df$value[1] = 3 is duplicated, but it is the original, so we set newColumn[1] = 0;
df$value[3] = 3 is duplicated, and it is the duplicate, so we set newColumn[3] = 1 (=df$label[1]);

here is my code:

library(dplyr)

df <- df %>%
group_by(group) %>%
mutate(
newColumn = ifelse(row_number() == min( which(duplicated(value) | duplicated(value, fromLast = TRUE)) ), 
                           label[max( which(duplicated(value) | duplicated(value, fromLast = TRUE)))],
                           0)
)

but it does not help. Any suggestion? Thank you in advance!

Upvotes: 1

Views: 849

Answers (2)

akrun
akrun

Reputation: 887048

We can also use data.table

library(data.table)
setDT(df)[, newColumn := c(0, rep(label[1L], .N-1)) , value]

Upvotes: 2

bgoldst
bgoldst

Reputation: 35314

Here's a solution using ave():

df$newColumn <- ave(df$label,df$value,FUN=function(x) c(0L,rep(x[1L],length(x)-1L)))
df;
##    group label value newColumn
## 1      1     1     3         0
## 2      1     2     4         0
## 3      1     3     3         1
## 4      1     4     5         0
## 5      1     5     4         2
## 6      2     1     6         0
## 7      2     2     3         1
## 8      2     3     9         0
## 9      2     4     6         1
## 10     2     5     1         0
## 11     2     6     3         1

ave() breaks up the first argument into groups according to the second argument and calls the lambda once for each group. So, for example, for all rows where df$value is equal to 3, ave() will construct a vector consisting of all values of df$label from those rows, and call the lambda with x equal to that vector. The return value of the lambda call is expected to contain the same number of elements as the argument x (or it will be recycled as necessary to make it so).

The return values of all calls of the lambda are then combined into one final vector, with each element of each return value placed into the position corresponding to its counterpart from the input. This allows us to build the final column vector by group. Since your problem requires returning zero for the first element in each group and the original label value for all subsequent elements in each group, we can build that subvector easily in the lambda by combining zero with the original label value repeated sufficiently to cover the remainder of the group vector.

Upvotes: 2

Related Questions