Reputation: 1677
On creating a column whose contents contain duplicate values, I notice the following with regard to factors.
1.If a column with duplicate character values is made part of a data frame at the time of data frame creation, it is of class factor, but if the same column is appended later, it is of class character though the values in both cases are the same. Why is this?
#creating a data frame
name = c('waugh','waugh','smith')
age = c(21,21,27)
df = data.frame(name,age)
#adding a new column which has the same values as the 'name' column above, to the data frame
df$newcol = c('waugh','waugh','smith')
#you can see that the class'es of the two are different though the values are same
class(df$name)
## [1] "factor"
class(df$newcol)
## [1] "character"
Only the column which has duplicate alphabetic contents becomes a factor; If a column contains duplicate numeric values, it is not treated as a factor. Why is that? I could very well mean that 1-Male, 0-Female, in which case, it should be a factor?
class(df$name)
## [1] "factor"
class(df$age)
## [1] "numeric"
Upvotes: 0
Views: 1191
Reputation: 206401
This was basically answered in the comments, but i'll put the answer here to close out the question.
When you use data.frame()
to create a data.frame, that function actually manipulates the arguments you pass in to create the data.frame object. Specifically, by default, it has a parameter named stringsAsFactors=TRUE
so that it will take all character vectors you pass in and convert them to factor vectors since normally you treat these values as categorical random variables in various statistical tests and it can be more efficient to store character values as a factor if you have many values that are repeated in the vector.
df <- data.frame(name,age)
class(df$name)
# [1] "factor"
df <- data.frame(name,age, stringsAsFactors=FALSE)
class(df$name)
# [1] "character"
Note that the data.frame itself doesn't remember the "stringsAsFactors" value used during its construction. This is only used when you actually run data.frame()
. So if you add columns by assigning them via the $<-
syntax or cbind()
, the coercion will not happen
df1 <- data.frame(name,age)
df2 <- data.frame(name,age, stringsAsFactors=FALSE)
df1$name2 <- name
df2$name2 <- name
df3 <- cbind(data.frame(name,age), name2=name)
class(df1$name2)
# [1] "character"
class(df2$name2)
# [1] "character"
class(df3$name2)
# [1] "character"
If you want to add the column as a factor, you will need to convert to factor yourself
df = data.frame(name,age)
df$name2 <- factor(name)
class(df$name2)
# [1] "factor"
Upvotes: 1