Reputation: 2111
I know that in R for loops should be avoided and vectorized operations should be used instead.
I want to solve this with a for
loop and then try to use the apply
family, then also in Rcpp.
I load a dataset containing one column of passwords (alphanumeric).
Once loaded (a sample, for speed), I want to create new column with value (0,1) based on some conditions "contains_lower_chars", "contains_numbers" and so on.
Here what I tried to do, but it doesn't work - meaning each column I create has the same value.
library(tidyverse)
set.seed(123)
# load dataset from url, skip the first 16 rows
df <- read.csv('http://datashaping.com/passwords.txt', header = F, skip = 16) %>%
sample_frac(.001) %>%
rename(password = V1)
patterns = c("[a-z]","[A-Z]","[0-9]+")
df$has_lower <- 0
df$has_upper <- 0
df$has_numeric <- 0
for(i in 1:nrow(df)){
for(j in patterns){
n <- ifelse(grepl(j, df$password[i]),1,0)
}
df$has_lower[i] <- n
df$has_upper[i] <- n
df$has_numeric[i] <- n
}
Output I have in mind is:
password has_lower has_upper has_numeric
Bigmaccas 1 1 0
0127515559 0 0 1
dbqky73p 1 0 1
Upvotes: 3
Views: 1351
Reputation: 680
First you need to update has.lower has.upper and has.numeric within the j loop otherwise your n
remains the same for this 3 cases. To do so you need to be able to loop over the names of the columns has.lower has.upper and has.numeric:
names <- c("has_lower","has_upper","has_numeric")
for(i in 1:nrow(df)){
for(j in 1:length(patterns)){
df[i,(names[j])] <- as.numeric(grepl(j, df$password[i]))
}
}
A quicker, nicer, more compact alternative using apply
and the fact that grepl
is already vectorized:
df[, c("has_lower","has_upper","has_numeric"):=lapply(patterns, function(x) grepl(x,df$password))]
Note (nothing to do with your question):
I advise you to use the fread
function to read your dataset since it is quite large.
df = fread('http://datashaping.com/passwords.txt', header = F, skip = 16)%>%
sample_frac(.001) %>%
rename(password = V1)
Upvotes: 0
Reputation: 11738
A data frame is above all a list.
So, you can simply do:
df[c("has_lower", "has_upper", "has_numeric")] <-
lapply(patterns, function(pattern) grepl(pattern, df$password) + 0)
Use + 0L
instead of + 0
is you want integers instead of doubles (I would recommend to do nothing and to keep logicals).
Upvotes: 0
Reputation: 206606
We can simplify things if we just name your pattern vector. For example
patterns = c(has_lower="[a-z]",
has_upper="[A-Z]",
has_numeric="[0-9]+")
for(pattern in names(patterns)) {
df[, pattern] = as.numeric(grepl(patterns[pattern], df$password))
}
Basically we just loop through each of the names, grab the regular expression corresponding to that name, then do the matching and adding the column.
Upvotes: 1