Reputation: 343
I want to number the letters in a large dataset. Some letters occur multiple times and are numbered ("A1", "A2"), others also occur multiple times but are not numbered. There are also letters that occur only once... but maybe it's easier to look at the example data below.
The numbers in df$nr are the desired result. How can I get df$nr from df$word and df$letter ?
df <-tibble(word=c(rep("Amamam", 17), rep("Bobob", 14)),
letter=c("A1", "A1", "A1", "A1", "A2", "A2", "m", "m", "m", "a", "a", "m", "m", "a", "a", "m", "m",
"B1", "B1", "B2", "B2", "B3", "B3", "o", "b", "b", "b", "o", "o", "o", "b"),
nr=c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6,
1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 5) )
Upvotes: 6
Views: 62
Reputation: 887481
We can group by 'word', remove the numeric part from the 'letter' column, convert to run-length-id (rleid
from data.table
)
library(dplyr)
library(stringr)
library(data.table)
df1 <- df %>%
group_by(word) %>%
mutate(nr1 = rleid(str_remove(letter, "\\d+")))
all.equal(df1$nr, df1$nr1)
#[1] TRUE
Upvotes: 3