Reputation: 69
I am dealing with a dataset containing US States FIPS codes coded as characters, where codes from 1 to 9 sometimes have a 0 prefix (01, 02,...). While trying to clean it up I came across the following issue:
test <- data.frame(fips = c(1,"01")) %>%
mutate(fips = as.numeric(fips))
> test
fips
1 2
2 1
where 1 is converted as a 2, and 01 as a 1. This annoying behavior disappears with a tibble:
test <- tibble(fips = c(1,"01")) %>%
mutate(fips = as.numeric(fips))
> test
# A tibble: 2 x 1
fips
<dbl>
1 1
2 1
Does anyone know what is going on? Thanks
Upvotes: 1
Views: 284
Reputation: 37641
This is a difference in the defaults for tibbles and data.frames. When you mix together strings and numbers as in c(1, "01"), R converts everything to a string.
c(1, "01")
[1] "1" "01"
The default behavior for data.frame
is to make strings into factors. If you look at the help page for data.frame
you will see the argument:
stringsAsFactors: ... The ‘factory-fresh’ default is TRUE
So data frame makes c(1, "01") into a factor with two levels "1" and "01"
T1 = data.frame(fips = c(1,"01"))
str(T1)
'data.frame': 2 obs. of 1 variable:
$ fips: Factor w/ 2 levels "01","1": 2 1
Now factors are stored as integers for efficiency. That is why you see 2 1 at the end of the about output of str(T1). So if you directly convert that to an integer, you get 2 and 1.
You can get the behavior that you want, either by making the data.frame more carefully with
T1 = data.frame(fips = c(1,"01"), stringsAsFactors=FALSE)
or you can convert the factor to a string before converting to a number
fips = as.numeric(as.character(fips))
Tibbles do not have this problem because they do not convert the strings to factors.
Upvotes: 5