Reputation: 1445
I am making a bioinformatics shiny app that reads user-supplied group names from an excel file. As these names can be non-sytactically valid names, I would like to represent them internally as valid names.
As an example, I can have this input:
(grps <- as.factor(c("T=0","T=0","T=4-","T=4+","T=4+")))
[1] T=0 T=0 T=4- T=4+ T=4+
Levels: T=0 T=4- T=4+
Ideally, I would like R to make valid names, but keep the groups/levels the same, for instance the following would be fine: "T.0" "T.0" "T.4minus" "T.4plus" "T.4plus"
When using make.names() however, all non-valid characters are converted to the same charater:
(grps2 <- as.factor(make.names(grps)))
[1] T.0 T.0 T.4. T.4. T.4.
Levels: T.0 T.4.
So both T=4- and T=4+ are given the same name and a level is lost (which causes problems in subsequent analyses). Also, setting unique=TRUE does not solve the problem, because
(grps3 <- as.factor(make.names(grps,unique=TRUE)))
[1] T.0 T.0.1 T.4. T.4..1 T.4..2
Levels: T.0 T.0.1 T.4. T.4..1 T.4..2
and group T=4+ is split into 2 different groups and levels are gained.
Does anybody know how it is possible in general to make a factor into valid names, while keeping the same levels? Please keep in mind that user input can widely vary, so manually replacing "-" with "minus" does not work here.
Thanks in advance for your help!
Upvotes: 0
Views: 1392
Reputation: 17289
The labels associated with the levels of a factor are not required to fit the same expectations of object names. Consider the following example, where I rename the gear
columns of the mtcars
data set, make it a factor, and give it the same levels as you have given in your example.
library(magrittr)
library(dplyr)
library(broom)
D <- mtcars[c("mpg", "gear")] %>%
setNames(c("y", "grps")) %>%
mutate(grps = factor(grps, 3:5, c("T=0", "T=4-", "T=4+")))
Notice that I am able to fit a linear model, get a summary, force it to a data frame, all while the level names have the =
, -
, and +
symbols in them.
fit <- lm(y ~ grps, data = D)
fit
Call:
lm(formula = y ~ grps, data = D)
Coefficients:
(Intercept) grpsT=4- grpsT=4+
16.107 8.427 5.273
summary(fit)
Call:
lm(formula = y ~ grps, data = D)
Residuals:
Min 1Q Median 3Q Max
-6.7333 -3.2333 -0.9067 2.8483 9.3667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.107 1.216 13.250 7.87e-14 ***
grpsT=4- 8.427 1.823 4.621 7.26e-05 ***
grpsT=4+ 5.273 2.431 2.169 0.0384 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.708 on 29 degrees of freedom
Multiple R-squared: 0.4292, Adjusted R-squared: 0.3898
F-statistic: 10.9 on 2 and 29 DF, p-value: 0.0002948
tidy(fit)
term estimate std.error statistic p.value
1 (Intercept) 16.106667 1.215611 13.249852 7.867272e-14
2 grpsT=4- 8.426667 1.823417 4.621361 7.257382e-05
3 grpsT=4+ 5.273333 2.431222 2.169005 3.842222e-02
So I'm left thinking that either
Upvotes: 1
Reputation: 2535
With the mapvalues
function from plyr
you can do:
require("plyr")
mapvalues(grps, levels(grps), make.names(levels(grps), unique=TRUE))
Since this works directly on the levels instead of the factor, the number of the values stays the same.
Upvotes: 2