R dataframe regular expression

Question

In the following example data frame:

# generate example data frame
data <- data.frame(matrix(data=c("a","b","c","d","e","f"), nrow=70, ncol=5))
data <- apply(data,1, function(x) {paste(x, collapse = " > ")})
data <- data.frame(id=1:length(data), x = data)
data$x <- as.character(data$x)

> head(data)
  id                 x
1  1 a > e > c > a > e
2  2 b > f > d > b > f
3  3 c > a > e > c > a
4  4 d > b > f > d > b
5  5 e > c > a > e > c
6  6 f > d > b > f > d

Some of the attributes in column x are known in advance, but not all of them.

The attributes which are known will be replaced with individual names. In the example the set of known attributes is {"a","c","f"}.

All attributes that do not belong to this set are not known in advance and should be replaced by NA.

Step 1: Replace attributes {"a","c","f"}

# substitute all relevant attributes with according Names
data$x <- gsub("a", "Anton",data$x)
data$x <- gsub("c", "Chris",data$x)
data$x <- gsub("f", "Flo",data$x)

The data frame now looks as:

> head(data)
  id                                 x
1  1     Anton > e > Chris > Anton > e
2  2             b > Flo > d > b > Flo
3  3 Chris > Anton > e > Chris > Anton
4  4               d > b > Flo > d > b
5  5     e > Chris > Anton > e > Chris
6  6             Flo > d > b > Flo > d

Step 2: Replace all attributes other than {"Anton", "Chris", "Flo"} with NA

This is where I need help.

My idea is to make use of regular expressions and replace every value/character string that is not in {"Anton", "Chris", "Flo", ">"} with "NA".

In my real problem I don´t know the values {"b","d","e"} and the attributes can take on any value or word with length greater than 1. Moreover the values of the unkown set can change over time. So if the function will be executed in a later instance there can be new unknown values.

Result: The resulting data frame should look like:

> head(data)
  id                                  x
1  1    Anton > NA > Chris > Anton > NA
2  2           NA > Flo > NA > NA > Flo
3  3 Chris > Anton > NA > Chris > Anton
4  4            NA > NA > Flo > NA > NA
5  5    NA > Chris > Anton > NA > Chris
6  6           Flo > NA > NA > Flo > NA

Any help is appreciated!

akrun · Accepted Answer

You could try mgsub from qdap

library(qdap)
data$x <- mgsub(c('a', 'c', 'f', 'd', 'e', 'b'),
      c('Anton', 'Chris', 'Flo', 'NA', 'NA', 'NA'), data$x)
head(data,3)
#  id                                  x
#1  1    Anton > NA > Chris > Anton > NA
#2  2           NA > Flo > NA > NA > Flo
#3  3 Chris > Anton > NA > Chris > Anton

Update

Suppose if we know only the list of elements ("v1") to be replaced by other elements "v3", then we could get the other elements ("v2") by removing the element in "v1" and the "punct" characters of "x" column with gsub. Use this info for feeding into the mgsub

v1 <-  c('a', 'c', 'f')
v2 <- unique(scan(text=gsub(paste(c(v1,"[[:punct:]]+"),
    collapse="|"), "", data$x), what='', quiet=TRUE))

v3 <- c('Anton', 'Chris', 'Flo')
data$x <- mgsub(c(v1, v2), c(v3, rep("NA", length(v2))), data$x)
head(data,3)
 #  id                                  x
 #1  1    Anton > NA > Chris > Anton > NA
 #2  2           NA > Flo > NA > NA > Flo
 #3  3 Chris > Anton > NA > Chris > Anton

Update2

You could also do this without using any external packages

 names(v3) <- v1
 data$x <- sapply(strsplit(data$x, ' > '), function(x)
                 paste(v3[x], collapse=" > "))
 head(data,3)
 #  id                                  x
 #1  1    Anton > NA > Chris > Anton > NA
 #2  2           NA > Flo > NA > NA > Flo
 #3  3 Chris > Anton > NA > Chris > Anton

R dataframe regular expression

Answers (2)

Update

Update2

Related Questions