Reputation: 35
I'm trying to create a variable that identifies if a string within a vector is the first appearance, in the top three, or more than three. For example:
In the data set below, I have name (there will be more names), text, and a dup variable. I want the dup variable to identify if the text is appearing for the first time (origin), if it's within the first three occurrences (FirstThree) or if it has appeared more than the three times(MoreThanThree). I will also need to do that for each person... but I think I can figure that part out. Thanks in advance for any help!
name =c("T","T","T","T","T","T","T","T","T","T")
text =c("a","b","a","a","b","c","a","a","b","a")
dup =c("origin","origin","FirstThree","FirstThree","FirstThree","origin","MoreThanThree","MoreThanThree","FirstThree","MoreThanThree")
dfA = data.frame(name,text,dup)
name text dup
1 T a origin
2 T b origin
3 T a FirstThree
4 T a FirstThree
5 T b FirstThree
6 T c origin
7 T a MoreThenThree
8 T a MoreThenThree
9 T b FirstThree
10 T a MoreThenThree
Upvotes: 2
Views: 41
Reputation: 388817
In dplyr
, we can compare the row_number()
in a case_when
statement.
library(dplyr)
dfA %>%
group_by(text) %>%
mutate(row = row_number(),
dup = case_when(row == 1 ~ "origin",
row <= 3 ~ "FirstThree",
TRUE ~ "MoreThenThree"))
# name text row dup
# <fct> <fct> <int> <chr>
# 1 T a 1 origin
# 2 T b 1 origin
# 3 T a 2 FirstThree
# 4 T a 3 FirstThree
# 5 T b 2 FirstThree
# 6 T c 1 origin
# 7 T a 4 MoreThenThree
# 8 T a 5 MoreThenThree
# 9 T b 3 FirstThree
#10 T a 6 MoreThenThree
We can remove the row
column later if not needed.
Upvotes: 0
Reputation: 28675
You can use data.table::rowid
with two ifelse
checks
dfA[, ict := {
r <- rowid(text)
ifelse(r == 1, 'origin',
ifelse(r <= 3, 'FirstThree',
'MoreThanThree'))}
, by = name]
dfA
# name text dup ict
# 1: T a origin origin
# 2: T b origin origin
# 3: T a FirstThree FirstThree
# 4: T a FirstThree FirstThree
# 5: T b FirstThree FirstThree
# 6: T c origin origin
# 7: T a MoreThanThree MoreThanThree
# 8: T a MoreThanThree MoreThanThree
# 9: T b FirstThree FirstThree
# 10: T a MoreThanThree MoreThanThree
You could also use cut
. Only difference is this produces a factor rather than character. May be useful if you have more than 3 categories
dfA[, ict := cut(rowid(text), c(0, 1, 3, Inf),
labels = c('origin', 'FirstThree', 'MoreThanThree'))
, by = name]
Upvotes: 2