Reputation: 1285
I have a dataset of the following structure (dummy data, but similar to what I have):
data <- data.frame(msg = c("this is sample 1", "another text", "cats are cute", "another text", "", "...", "another text", "missing example case", "cats are cute"),
no = c(1, 15, 23, 9, 7, 5, 35, 67, 35),
pat = c(0.11, 0.45, 0.3, 0.2, 0.6, 0.890, 0.66, 0.01, 0))
I'm interested in the column msg
. I need to label each row with TRUE
or FALSE
in a new column (namely, usable
). This labelling has to be done on conditions:
msg
cell is empty (NA or empty string) => FALSEmsg
cell only has symbols (no letters no numbers) => FALSEmsg
was already there (assuming rows are in ascending order) => FALSE. Notice that the first entry will be TRUE, and the repeated will be FALSE. I don't care about the other columns (they are irrelevant on the comparison), but on my end result, I need to have all of the columns.I did a very lengthy approach with a for, but I am looking at something shorter and better performing since the original dataset is long.
Upvotes: 0
Views: 51
Reputation: 1972
A tidyverse option. Note that map2_lgl
is for convenience rather than speed.
library(dplyr)
library(purrr)
library(stringr)
data %>%
mutate(id = row_number(),
usable = map2_lgl(msg, id,
~ case_when(is.na(.x) | .x == '' ~ F,
!str_detect(.x, '\\w') ~ F,
.x %in% msg[1:.y-1] ~ F,
T ~ T))) %>%
select(-id)
# msg no pat usable
# 1 this is sample 1 1 0.11 TRUE
# 2 another text 15 0.45 TRUE
# 3 cats are cute 23 0.30 TRUE
# 4 another text 9 0.20 FALSE
# 5 7 0.60 FALSE
# 6 ... 5 0.89 FALSE
# 7 another text 35 0.66 FALSE
# 8 missing example case 67 0.01 TRUE
# 9 cats are cute 35 0.00 FALSE
Upvotes: 2