Soren Christensen
Soren Christensen

Reputation: 376

datatypes are factors in R

I have a problem understanding datastructures in R.

key_stats <- data.frame(X= character(),
                    Y= character())

I want to make a dataframe and fill it with data. Here it is try to make a dataframe called key_stats and I want to populate it with text strings.

key_stats[1,1] <- "test"
key_stats[1,2] <- "test"

But no.. it gives me a warning and is not filling the data.frame with text:

key_stats[1,2] <- "test" 
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "test") :
  invalid factor level, NA generated

what strikes me is that eventhough I have made it explicit that key_stats is character R is changeing the datatype to factor.

The work around is simple:

key_stats [,1] <- as.character(key_stats[,1])
key_stats [,2] <- as.character(key_stats[,2])

But what is going on.. why does R change the datatype of the object?

Upvotes: 1

Views: 69

Answers (2)

Antonios
Antonios

Reputation: 1939

@Tim Biegeleisen gave the most straight forward answer.

You might also want to consider moving from data frames to tibbles, which among others do not by default convert character variables to factors

library(dplyr)
key_stats <- tribble(~X,~Y,"test","test")

> str(key_stats)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  2 variables:
 $ X: chr "test"
 $ Y: chr "test"

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520938

Try creating the data frame with the stringsAsFactors option set to FALSE:

key_stats <- data.frame(X=character(),
                        Y=character(),
                        stringsAsFactors=FALSE)

Dealing with factors can be a big headache if you are just starting out with R. If you're wondering why factors even exist, it is a matter of storage efficiency and normalization of your data. Imagine you have a character column with a lot of repeated data. It is wasteful to store repetitive information. Factors help here because with factors the level is stored in the column, with the actual text being stored just once somewhere else.

Many other languages also have this concept, e.g. the enum type in Java or MySQL.

Upvotes: 3

Related Questions