Sam
Sam

Reputation: 518

R: Subsetting using a nonexistent variable gives no error

While creating lead variables, I accidentally left out the lead variable by which my data is grouped. I was using brackets to insert an NA and there no was no error reported. To check my sanity, I did the same with ifelse and R created an error message. My concern is that if not for careful review, and some luck, I may never have known of my error.

How have others coded differently to make such less likely in the future (with minimal cost to time)? Also, are there other similar issues I should be aware of? Thanks, reproducible example is below.

dt <- data.frame(
group_name = c("D44", "D44","D44", "D45", "D45", "D47", "D47", "D47", "D47", "D48"),
order_number = sample(1:10))

dt$group_name <- as.character(dt$group_name) # so not a factor

dt <- dt[order(dt$group_name, dt$order_number),] # sort data

dt$lead1order_number <- c(dt$order_number[-1], NA)

# COMMENT OUT NEXT LINE AND RUN, no error with brackets, but one with ifelse
dt$lead1group_name <- c(dt$group_name[-1], NA) 

# done two different ways below
    # if group_name doesn't match lead1group_name, then lead1order_number NA
dt$lead1order_number[dt$group_name != dt$lead1group_name] <- NA  

dt$lead1order_number <- ifelse(dt$group_name != dt$lead1group_name, NA, dt$lead1order_number)

Upvotes: 1

Views: 126

Answers (1)

Ekatef
Ekatef

Reputation: 1061

You question is a deep one. The issue with brackets aka subsetting is one of the key features of R. It's difficult to answer on your question in a comprehensive way. I just propose one of the possible simplest solutions:

# `stringsAsFactors = FALSE` ensures that strings will not be transformed to factors
dt <- data.frame(group_name = c("D44", "D44","D44", "D45", 
    "D45", "D47", "D47", "D47", "D47", "D48"),
    order_number = sample(1:10), stringsAsFactors = FALSE)
dt <- dt[order(dt$group_name, dt$order_number),] # sort data
dt$lead1order_number <- c(dt$order_number[-1], NA)
# the example was slightly modified to demonstrate subsetting with NA
dt$lead1group_name <- c(dt$group_name[-c(1:2)], NA, "D")

Let's suppose, we need a column "lead2group_name", which is missed in our data frame. The key issue which I propose to use is that the different subsetting methods give different results:

simplifying subsetting with $ or [[ will give nothing as a result:

print(dt$lead2group_name)
> NULL

preserving subsetting with [ results in an error:

print(dt[ ,"lead2group_name", drop = FALSE])

Error in [.data.frame(dt, , "lead2group_name") : undefined columns selected

I would use this issue to be sure that the requested column exists in the data.frame:

ind_of_non_match <- which(dt[ ,"group_name", drop = FALSE] != dt[ ,"lead1group_name", drop = FALSE])
ind_of_na <- which(is.na(dt[ , "lead1group_name", drop = FALSE]))
dt$lead1order_number[c(ind_of_non_match, ind_of_na)] <- NA

Note, please, that a one-step approach

dt$lead1order_number[(dt[ ,"group_name", drop = FALSE] != dt[ ,"lead1group_name", drop = FALSE])] <- NA

silently ignores NA values of "lead1group_name". That don't seem to be the safest way. That is why I would rather use which() to separate non-matching of "lead1group_name" and group_name from presence of NA in the "lead1group_name".

Hope, it'll be useful for your current work. As for your general concerns related to using of subsetting and assigment, you may find useful to have a look on ?Extract of the R help and to study subsetting methods in more details using R tutorials.

Upvotes: 1

Related Questions