MartijnVanAttekum
MartijnVanAttekum

Reputation: 1445

Creating syntactically valid names from a factor in R while retaining levels

I am making a bioinformatics shiny app that reads user-supplied group names from an excel file. As these names can be non-sytactically valid names, I would like to represent them internally as valid names.

As an example, I can have this input:

(grps <- as.factor(c("T=0","T=0","T=4-","T=4+","T=4+")))
[1] T=0  T=0  T=4- T=4+ T=4+
Levels: T=0 T=4- T=4+

Ideally, I would like R to make valid names, but keep the groups/levels the same, for instance the following would be fine: "T.0" "T.0" "T.4minus" "T.4plus" "T.4plus"

When using make.names() however, all non-valid characters are converted to the same charater:

(grps2 <- as.factor(make.names(grps)))
[1] T.0  T.0  T.4. T.4. T.4.
Levels: T.0 T.4.

So both T=4- and T=4+ are given the same name and a level is lost (which causes problems in subsequent analyses). Also, setting unique=TRUE does not solve the problem, because

(grps3 <- as.factor(make.names(grps,unique=TRUE)))
[1] T.0    T.0.1  T.4.   T.4..1 T.4..2
Levels: T.0 T.0.1 T.4. T.4..1 T.4..2

and group T=4+ is split into 2 different groups and levels are gained.

Does anybody know how it is possible in general to make a factor into valid names, while keeping the same levels? Please keep in mind that user input can widely vary, so manually replacing "-" with "minus" does not work here.

Thanks in advance for your help!

Upvotes: 0

Views: 1392

Answers (2)

Benjamin
Benjamin

Reputation: 17289

The labels associated with the levels of a factor are not required to fit the same expectations of object names. Consider the following example, where I rename the gear columns of the mtcars data set, make it a factor, and give it the same levels as you have given in your example.

library(magrittr)
library(dplyr)
library(broom)
D <- mtcars[c("mpg", "gear")] %>%
  setNames(c("y", "grps")) %>%
  mutate(grps = factor(grps, 3:5, c("T=0", "T=4-", "T=4+")))

Notice that I am able to fit a linear model, get a summary, force it to a data frame, all while the level names have the =, -, and + symbols in them.

fit <- lm(y ~ grps, data = D)

fit
Call:
lm(formula = y ~ grps, data = D)

Coefficients:
(Intercept)     grpsT=4-     grpsT=4+  
     16.107        8.427        5.273  


summary(fit)

Call:
lm(formula = y ~ grps, data = D)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7333 -3.2333 -0.9067  2.8483  9.3667 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   16.107      1.216  13.250 7.87e-14 ***
grpsT=4-       8.427      1.823   4.621 7.26e-05 ***
grpsT=4+       5.273      2.431   2.169   0.0384 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.708 on 29 degrees of freedom
Multiple R-squared:  0.4292,    Adjusted R-squared:  0.3898 
F-statistic:  10.9 on 2 and 29 DF,  p-value: 0.0002948



tidy(fit)

         term  estimate std.error statistic      p.value
1 (Intercept) 16.106667  1.215611 13.249852 7.867272e-14
2    grpsT=4-  8.426667  1.823417  4.621361 7.257382e-05
3    grpsT=4+  5.273333  2.431222  2.169005 3.842222e-02

So I'm left thinking that either

  1. You're making things harder on yourself than you need to, or
  2. It isn't clear why you need to make the levels syntactically valid object names.

Upvotes: 1

snaut
snaut

Reputation: 2535

With the mapvalues function from plyr you can do:

require("plyr")
mapvalues(grps, levels(grps), make.names(levels(grps), unique=TRUE))

Since this works directly on the levels instead of the factor, the number of the values stays the same.

Upvotes: 2

Related Questions