Fred Poole
Fred Poole

Reputation: 35

Sequencing and Evaluating Duplicates in a vector

I'm trying to create a variable that identifies if a string within a vector is the first appearance, in the top three, or more than three. For example:

In the data set below, I have name (there will be more names), text, and a dup variable. I want the dup variable to identify if the text is appearing for the first time (origin), if it's within the first three occurrences (FirstThree) or if it has appeared more than the three times(MoreThanThree). I will also need to do that for each person... but I think I can figure that part out. Thanks in advance for any help!

name =c("T","T","T","T","T","T","T","T","T","T")
text =c("a","b","a","a","b","c","a","a","b","a")
dup =c("origin","origin","FirstThree","FirstThree","FirstThree","origin","MoreThanThree","MoreThanThree","FirstThree","MoreThanThree")
dfA = data.frame(name,text,dup)

 name text           dup
1     T    a        origin
2     T    b        origin
3     T    a    FirstThree
4     T    a    FirstThree
5     T    b    FirstThree
6     T    c        origin
7     T    a MoreThenThree
8     T    a MoreThenThree
9     T    b    FirstThree
10    T    a MoreThenThree

Upvotes: 2

Views: 41

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388817

In dplyr, we can compare the row_number() in a case_when statement.

library(dplyr)

dfA %>%
  group_by(text) %>%
  mutate(row = row_number(), 
         dup = case_when(row == 1 ~ "origin", 
                         row <= 3 ~ "FirstThree", 
                         TRUE ~ "MoreThenThree"))

#   name  text    row dup          
#   <fct> <fct> <int> <chr>        
# 1 T     a         1 origin       
# 2 T     b         1 origin       
# 3 T     a         2 FirstThree   
# 4 T     a         3 FirstThree   
# 5 T     b         2 FirstThree   
# 6 T     c         1 origin       
# 7 T     a         4 MoreThenThree
# 8 T     a         5 MoreThenThree
# 9 T     b         3 FirstThree   
#10 T     a         6 MoreThenThree

We can remove the row column later if not needed.

Upvotes: 0

IceCreamToucan
IceCreamToucan

Reputation: 28675

You can use data.table::rowid with two ifelse checks

dfA[, ict := {
        r <- rowid(text)
        ifelse(r == 1, 'origin', 
        ifelse(r <= 3, 'FirstThree', 
               'MoreThanThree'))}
    , by = name]

dfA
#     name text           dup           ict
#  1:    T    a        origin        origin
#  2:    T    b        origin        origin
#  3:    T    a    FirstThree    FirstThree
#  4:    T    a    FirstThree    FirstThree
#  5:    T    b    FirstThree    FirstThree
#  6:    T    c        origin        origin
#  7:    T    a MoreThanThree MoreThanThree
#  8:    T    a MoreThanThree MoreThanThree
#  9:    T    b    FirstThree    FirstThree
# 10:    T    a MoreThanThree MoreThanThree

You could also use cut. Only difference is this produces a factor rather than character. May be useful if you have more than 3 categories

dfA[, ict := cut(rowid(text), c(0, 1, 3, Inf), 
                 labels = c('origin', 'FirstThree', 'MoreThanThree'))
    , by = name]

Upvotes: 2

Related Questions