Reputation: 23
I have a dataframe with multiple columns of strings. I want to
regex
My question has two parts.
First, I am using sapply
and grepl
and have managed to create a dataframe of TRUE/FALSE values for all of the columns in question, but am having trouble figuring out how to get to the "create a new single column" step. I tried rowwise
mutate but that's throwing errors that I honestly don't understand!
df <- data.frame(
idx = 1:5,
column_b = letters[1:5],
column_c = c('abc', 'abc', 'def', 'def', 'ghi'),
column_d = c('def', 'def', 'def', 'def', 'def'),
column_e = c('ghi', 'ghi', 'ghi', 'abc', 'ghi')
)
apply_factor <- function(df, factor, col_low, col_high, pattern) {
df %>%
rowwise() %>%
mutate(factor = sum(c_across(as.data.frame(sapply(select(df, {{col_low}}:{{col_high}}), grepl, pattern={{pattern}})))), na.rm = TRUE)
}
apply_factor(df, factor = 'abc', 'column_c', 'column_e', pattern = "^abc")
(double curly braces {{}}
in the function due to something or other about dplyr
I saw in another question...)
Console responds:
Error in `mutate()`:
! Problem while computing `factor = sum(...)`.
i The error occurred in row 1.
Caused by error in `as_indices_impl()`:
! Must subset columns with a valid subscript vector.
x Subscript has the wrong type `data.frame<
column_c: logical
column_d: logical
column_e: logical
>`.
i It must be numeric or character.
If I run just the as.data.frame() part it gives me something that looks pretty good. I just need to make it into a new column!
> as.data.frame(sapply(select(df, 'column_c':'column_e'), grepl, pattern = "^abc"))
column_c column_d column_e
1 TRUE FALSE FALSE
2 TRUE FALSE FALSE
3 FALSE FALSE FALSE
4 FALSE FALSE TRUE
5 FALSE FALSE FALSE
Second, I need to extend this function to a much larger dataset with many millions of rows and between 30 and 100 string columns. The large and variable number of columns, by the way, is why I need this function to be able to take any range of columns instead of listing out every possible column. Since processing time will be somewhat of a factor: is what I am doing here going to be massively inefficient? Is there an easier, faster way to accomplish this?
I've tried variations on sum, summarise
, and any, but honestly it's very likely I did it wrong.
I also looked at this post, but I can't limit to using %in%
as I need regex (I'm pretty sure I can't use regex with %in%...)
(I am relearning R after a long stint away in Stata-land, so please forgive any really obvious blunders or terminology weirdness!)
Upvotes: 2
Views: 101
Reputation: 35594
grepl
is vectorized, so you can use it with if_any()
, which is much faster than rowwise
.
df %>%
mutate(abc = if_any(column_c:column_e, ~ grepl("^abc", .x)))
# idx column_b column_c column_d column_e abc
# 1 1 a abc def ghi TRUE
# 2 2 b abc def ghi TRUE
# 3 3 c def def ghi FALSE
# 4 4 d def def abc TRUE
# 5 5 e ghi def ghi FALSE
To wrap the code in a function:
ifany_grepl <- function(data, col, lower, upper, pattern) {
data %>%
mutate(!!sym(col) := if_any({{lower}}:{{upper}}, ~ grepl(pattern, .x)))
}
ifany_grepl(df, 'abc', 'column_c', 'column_e', "^abc")
Upvotes: 1