Adam Liter
Adam Liter

Reputation: 925

R: Accidentally subsetting a data frame using a factor column as if it were logical

I inherited some legacy R code to work with that was recoding some values in a column on the basis of a value in some other column in that same row that was mistakenly thought to be a boolean value when, in reality, those values were actually (strings being converted to) factors, like so:

df <- data.frame(value = c(1, 2, 3, 4, 5, 6),
                 reversed = c("true", "false",
                              "true", "true",
                              "false", "false"))

str(df)
#> 'data.frame':    6 obs. of  2 variables:
#>  $ value   : num  1 2 3 4 5 6
#>  $ reversed: Factor w/ 2 levels "false","true": 2 1 2 2 1 1

df$recoded_value <- df$value
df$recoded_value[df$reversed] <- 7 - df$recoded_value[df$reversed]

If you inspect the results, this produces unintended results. df[2, "recoded_value"] is 5, but the intent is for it to be 2. Moreover, df[3, "recoded_value"] is 3, but the intent is for it to be 4.

I would like to understand what is going on here. My first hypothesis was that R was treating one factor level as TRUE and the other as FALSE. But this is obviously not the case because identical factor levels are not being treated identically:

df[c(1,3), ]
#>   value reversed recoded_value
#> 1     1     true             6
#> 3     3     true             3

df[c(2,5), ]
#>   value reversed recoded_value
#> 2     2    false             5
#> 5     5    false             5

What is going on here?

To clarify: I'm not interested in solutions to the problem. I know how to fix the code to produce the intended results. I would like to understand:

  1. Why does this code work at all? How can you subset on the basis of a factor column? What is `[` doing to even allow this?
  2. Why are the things that are the same value (i.e., same level of a factor) being treated differently?

Upvotes: 1

Views: 60

Answers (1)

akrun
akrun

Reputation: 887118

As it is mentioned in the post, reversed is a factor and not a logical vector. In R, TRUE/FALSE values are the logical, so convert to logical vector

df$reversed <- df$reversed=="true"

Regarding why we have unexpected output (from the OP's code),

df$reversed
#[1] true  false true  true  false false
#Levels: false true

the levels are in alphabetic order and the storage mode of factor is integer i.e.

as.integer(df$reversed)
#[1] 2 1 2 2 1 1

So when we subset the 'recoded_value' using the 'reversed', it will subset based on the numeric index

df$recoded_value[df$reversed]
#[1] 2 1 2 2 1 1

i.e. the first value in output is the second observation of 'recoded_value' and the second 1st observation and so on, instead if we use the correct logical index

df$recoded_value[df$reversed=="true"]
#[1] 1 3 4

Let's check how this will behave with the changed 'reversed'

df$reversed <- df$reversed=="true"
df$recoded_value[df$reversed] <- 7 - df$recoded_value[df$reversed]
df[c(1,3), ]
#  value reversed recoded_value
#1     1     TRUE             6
#3     3     TRUE             4
df[c(2,5),]
#  value reversed recoded_value
#2     2    FALSE             2
#5     5    FALSE             5

Upvotes: 1

Related Questions