Reputation: 925
I inherited some legacy R code to work with that was recoding some values in a column on the basis of a value in some other column in that same row that was mistakenly thought to be a boolean value when, in reality, those values were actually (strings being converted to) factors, like so:
df <- data.frame(value = c(1, 2, 3, 4, 5, 6),
reversed = c("true", "false",
"true", "true",
"false", "false"))
str(df)
#> 'data.frame': 6 obs. of 2 variables:
#> $ value : num 1 2 3 4 5 6
#> $ reversed: Factor w/ 2 levels "false","true": 2 1 2 2 1 1
df$recoded_value <- df$value
df$recoded_value[df$reversed] <- 7 - df$recoded_value[df$reversed]
If you inspect the results, this produces unintended results. df[2, "recoded_value"]
is 5, but the intent is for it to be 2. Moreover, df[3, "recoded_value"]
is 3, but the intent is for it to be 4.
I would like to understand what is going on here. My first hypothesis was that R was treating one factor level as TRUE
and the other as FALSE
. But this is obviously not the case because identical factor levels are not being treated identically:
df[c(1,3), ]
#> value reversed recoded_value
#> 1 1 true 6
#> 3 3 true 3
df[c(2,5), ]
#> value reversed recoded_value
#> 2 2 false 5
#> 5 5 false 5
What is going on here?
To clarify: I'm not interested in solutions to the problem. I know how to fix the code to produce the intended results. I would like to understand:
`[`
doing to even allow this?Upvotes: 1
Views: 60
Reputation: 887118
As it is mentioned in the post, reversed
is a factor
and not a logical
vector. In R
, TRUE/FALSE
values are the logical, so convert to logical
vector
df$reversed <- df$reversed=="true"
Regarding why we have unexpected output (from the OP's code),
df$reversed
#[1] true false true true false false
#Levels: false true
the levels
are in alphabetic order and the storage mode of factor
is integer
i.e.
as.integer(df$reversed)
#[1] 2 1 2 2 1 1
So when we subset the 'recoded_value' using the 'reversed', it will subset based on the numeric index
df$recoded_value[df$reversed]
#[1] 2 1 2 2 1 1
i.e. the first value in output is the second observation of 'recoded_value' and the second 1st observation and so on, instead if we use the correct logical index
df$recoded_value[df$reversed=="true"]
#[1] 1 3 4
Let's check how this will behave with the changed 'reversed'
df$reversed <- df$reversed=="true"
df$recoded_value[df$reversed] <- 7 - df$recoded_value[df$reversed]
df[c(1,3), ]
# value reversed recoded_value
#1 1 TRUE 6
#3 3 TRUE 4
df[c(2,5),]
# value reversed recoded_value
#2 2 FALSE 2
#5 5 FALSE 5
Upvotes: 1