Reputation: 681
I have a data frame with a column of factors and columns with values for each factor plus additional factors that are no longer included in the data frame. Example:
x <- data.frame(f= toupper(sample(letters[1:3], 5, replace=T)),
x.A = seq(1:5),
x.B = seq(1:5),
x.C = seq(1:5),
x.D = seq(1:5),
x.E = seq(1:5))
Resulting in:
f x.A x.B x.C x.D x.E
1 B 1 1 1 1 1
2 B 2 2 2 2 2
3 A 3 3 3 3 3
4 C 4 4 4 4 4
5 A 5 5 5 5 5
Now I want to remove all columns that do not represent a current level in column f, resulting in a data frame:
f x.A x.B x.C
1 B 1 1 1
2 B 2 2 2
3 A 3 3 3
4 C 4 4 4
5 A 5 5 5
Naming convention is consistent among levels and column names, and names always take the form somevariable.FACTORLEVEL. I would type all the names in a list to choose from, but it gets long and unwieldy. I tried using grep as follows:
subX <- x[x$f == 'B', grep('B', names(x))]
But don't quite get what I want and don't know how to extend that over all levels if it did work. I also looked at previous questions here and here, but they don't go as far as I need. Any help would be appreciated. Thanks.
Upvotes: 1
Views: 116
Reputation: 887128
We use sub
to remove the prefix x.
from the column names of 'x', check whether it is %in%
the 'f' column to create a logical vector
and use this to subset the columns of 'x'. We removed the first column name (as it is 'f') and later concatenated with TRUE
to include that column also in the subset.
x[c(TRUE,sub('.*\\.', '', names(x)[-1]) %in% x$f)]
Or we can use grepl
with pattern
specified by paste
ing the 'f' column to return a logical index as before.
x[c(TRUE,grepl(paste(x$f, collapse='|'), names(x)[-1]))]
Upvotes: 2
Reputation: 29
This would also work.
x[c(T, (gsub("x.", "", names(x)) %in% x$f)[-1])]
Upvotes: 2