Reputation: 675
In the following example data frame:
# generate example data frame
data <- data.frame(matrix(data=c("a","b","c","d","e","f"), nrow=70, ncol=5))
data <- apply(data,1, function(x) {paste(x, collapse = " > ")})
data <- data.frame(id=1:length(data), x = data)
data$x <- as.character(data$x)
> head(data)
id x
1 1 a > e > c > a > e
2 2 b > f > d > b > f
3 3 c > a > e > c > a
4 4 d > b > f > d > b
5 5 e > c > a > e > c
6 6 f > d > b > f > d
Some of the attributes in column x are known in advance, but not all of them.
The attributes which are known will be replaced with individual names. In the example the set of known attributes is {"a","c","f"}.
All attributes that do not belong to this set are not known in advance and should be replaced by NA
.
Step 1: Replace attributes {"a","c","f"}
# substitute all relevant attributes with according Names
data$x <- gsub("a", "Anton",data$x)
data$x <- gsub("c", "Chris",data$x)
data$x <- gsub("f", "Flo",data$x)
The data frame now looks as:
> head(data)
id x
1 1 Anton > e > Chris > Anton > e
2 2 b > Flo > d > b > Flo
3 3 Chris > Anton > e > Chris > Anton
4 4 d > b > Flo > d > b
5 5 e > Chris > Anton > e > Chris
6 6 Flo > d > b > Flo > d
Step 2: Replace all attributes other than {"Anton", "Chris", "Flo"} with NA
This is where I need help.
My idea is to make use of regular expressions and replace every value/character string that is not in {"Anton", "Chris", "Flo", ">"} with "NA".
In my real problem I don´t know the values {"b","d","e"} and the attributes can take on any value or word with length greater than 1. Moreover the values of the unkown set can change over time. So if the function will be executed in a later instance there can be new unknown values.
Result: The resulting data frame should look like:
> head(data)
id x
1 1 Anton > NA > Chris > Anton > NA
2 2 NA > Flo > NA > NA > Flo
3 3 Chris > Anton > NA > Chris > Anton
4 4 NA > NA > Flo > NA > NA
5 5 NA > Chris > Anton > NA > Chris
6 6 Flo > NA > NA > Flo > NA
Any help is appreciated!
Upvotes: 1
Views: 728
Reputation: 887148
You could try mgsub
from qdap
library(qdap)
data$x <- mgsub(c('a', 'c', 'f', 'd', 'e', 'b'),
c('Anton', 'Chris', 'Flo', 'NA', 'NA', 'NA'), data$x)
head(data,3)
# id x
#1 1 Anton > NA > Chris > Anton > NA
#2 2 NA > Flo > NA > NA > Flo
#3 3 Chris > Anton > NA > Chris > Anton
Suppose if we know only the list of elements ("v1") to be replaced by other elements "v3", then we could get the other elements ("v2") by removing the element in "v1" and the "punct" characters of "x" column with gsub
. Use this info for feeding into the mgsub
v1 <- c('a', 'c', 'f')
v2 <- unique(scan(text=gsub(paste(c(v1,"[[:punct:]]+"),
collapse="|"), "", data$x), what='', quiet=TRUE))
v3 <- c('Anton', 'Chris', 'Flo')
data$x <- mgsub(c(v1, v2), c(v3, rep("NA", length(v2))), data$x)
head(data,3)
# id x
#1 1 Anton > NA > Chris > Anton > NA
#2 2 NA > Flo > NA > NA > Flo
#3 3 Chris > Anton > NA > Chris > Anton
You could also do this without using any external packages
names(v3) <- v1
data$x <- sapply(strsplit(data$x, ' > '), function(x)
paste(v3[x], collapse=" > "))
head(data,3)
# id x
#1 1 Anton > NA > Chris > Anton > NA
#2 2 NA > Flo > NA > NA > Flo
#3 3 Chris > Anton > NA > Chris > Anton
Upvotes: 3
Reputation: 269654
This one-liner matches each word character against the names of the indicated list and replaces matches with the values associated with that name. If there is no match then NA
is used as the replacement value:
library(gsubfn)
data$x <- gsubfn("\\w", list(a = "Anton", c = "Chris", f = "Flo", NA), data$x)
giving:
> head(data)
id x
1 1 Anton > NA > Chris > Anton > NA
2 2 NA > Flo > NA > NA > Flo
3 3 Chris > Anton > NA > Chris > Anton
4 4 NA > NA > Flo > NA > NA
5 5 NA > Chris > Anton > NA > Chris
6 6 Flo > NA > NA > Flo > NA
Upvotes: 1