vaaaaaal
vaaaaaal

Reputation: 29

Finding data frame rows that contain a certain character only once

Sorry for potential duplicating, but I don't really know how to formulate my request. I work on R and I would like to be able to identify data frame cells that contain a certain character only one time.

In my df I have a column a that contains formulas stored as strings, e.g.

#  a  
1  y~x1+x2  
2  y~x2+x3  
3  y~x1+x2+x3  
4  y~x2+x4 
5  y~x1+x3+x4

and I would like to keep rows which formulas in column a have 2 explanatory variables, i.e. that only contain one "+". The idea would be to filter and to add kind of a dummy, such as the output would be like

#  a           b   
1  y~x1+x2     1   
2  y~x2+x3     1   
3  y~x1+x2+x3  0   
4  y~x2+x4     1   
5  y~x1+x3+x4  0 

Hope that's clear enough. Thanks for helping,
Val

Upvotes: 1

Views: 235

Answers (3)

lroha
lroha

Reputation: 34441

A third base alternative assuming there is always at least two predictors in the formula.

df$b <- +(!grepl("\\+.*\\+", df$a))

df
           a b
1    y~x1+x2 1
2    y~x2+x3 1
3 y~x1+x2+x3 0
4    y~x2+x4 1
5 y~x1+x3+x4 0

Upvotes: 1

GKi
GKi

Reputation: 39657

You can use gsub with [^+] to extract all + and nchar to get their number.

x$b <- +(nchar(gsub("[^+]", "", x$a)) == 1)
x
#           a b
#1    y~x1+x2 1
#2    y~x2+x3 1
#3 y~x1+x2+x3 0
#4    y~x2+x4 1
#5 y~x1+x3+x4 0

Or use gregexpr:

lapply(gregexpr("\\+", x$a), length) == 1
#[1]  TRUE  TRUE FALSE  TRUE FALSE

Or using it with lengths as suggested by @ThomasIsCoding:

lengths(gregexpr("\\+", x$a)) == 1
#[1]  TRUE  TRUE FALSE  TRUE FALSE

Or using grepl:

grepl("^[^+]*\\+[^+]*$", x$a)
#[1]  TRUE  TRUE FALSE  TRUE FALSE

Or with strsplit:

sapply(strsplit(x$a, ""), function(y) sum(y == "+")==1)
#[1]  TRUE  TRUE FALSE  TRUE FALSE

Data:

x <- read.table(header=TRUE, text="a
1  y~x1+x2
2  y~x2+x3
3  y~x1+x2+x3
4  y~x2+x4
5  y~x1+x3+x4", stringsAsFactors = FALSE)

Upvotes: 3

ThomasIsCoding
ThomasIsCoding

Reputation: 101335

Another base R solution is using gregexpr, i.e.,

df$b <- +(lengths(gregexpr("\\+",df$a))==1)

such that

> df
           a b
1    y~x1+x2 1
2    y~x2+x3 1
3 y~x1+x2+x3 0
4    y~x2+x4 1
5 y~x1+x3+x4 0

DATA

df <- structure(list(a = c("y~x1+x2", "y~x2+x3", "y~x1+x2+x3", "y~x2+x4", 
"y~x1+x3+x4")), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5"))

Upvotes: 1

Related Questions