Reputation: 4481

Legal column names in R and consequences of syntactically invalid column names

df <- read.csv(
  text = '"2019-Jan","2019-Feb",
  "3","1"', 
  check.names = FALSE
  )

OK, so I use check.names = FALSE and now my column names are not syntactically valid. What are the practical consequences?

df
#>   2019-Jan 2019-Feb   
#> 1        3        1 NA

And why is this NA appearing in my data frame? I didn't put that in my code. Or did I?

Here's the check.names man page for reference:

check.names logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names) so that they are, and also to ensure that there are no duplicates.

Upvotes: 1

Answers (3)

Konrad Rudolph

Reputation: 545488

The only consequence is that you need to escape or quote the names to work with them. You either string-quote and use standard evaluation with the [[ column subsetting operator:

df[['2019-Jan']]

… or you escape the identifier name with backticks (R confusingly also calls this quoting), and use the $ subsetting:

df$`2019-Jan`

Both work, and can be used freely (as long as they don’t lead to exceedingly unreadable code).

To make matters more confusing, R allows using '…' and "…" instead of `…` in certain contexts:

df$'2019-Jan'

Here, '2019-Jan' is not a character string as far as R is concerned! It’s an escaped identifier name.¹

This last one is a really bad idea because it confuses names² with character strings, which are fundamentally different. The R documentation advises against this. Personally I’d go further: writing 'foo' instead of `foo` to refer to a name should become a syntax error in future versions of R.

¹ Kind of. The R parser treats it as a character string. In particular, both ' and " can be used, and are treated identically. But during the subsequent evaluation of the expression, it is treated as a name.

² “Names”, or “symbols”, in R refer to identifiers in code that denote a variable or function parameter. As such, a name is either (a) a function name, (b) a non-function variable name, (c) a parameter name in a function declaration, or (d) an argument name in a function call.

Upvotes: 3

IceCreamToucan

Reputation: 28675

The NA issue is unrelated to the names. read.csv is expecting an input with no comma after the last column. You have a comma after the last column, so read.csv reads the blank space after "2019-Feb", as the column name of the third column. There is no data for this column, so an NA value is assigned.

Remove the extra comma and it reads properly. Of course, it may be easier to just remove the last column after using read.csv.

df <- read.csv(
  text = '"2019-Jan","2019-Feb"
  "3","1"', 
  check.names = FALSE
  )

df
#   2019-Jan 2019-Feb
# 1        3        1

Upvotes: 3

Hugh

Reputation: 16090

Consider df$foo where foo is a column name. Syntactically invalid names will not work.

As for the NA it’s a consequence of there being three columns in your first line and only two in your second.

Upvotes: 2

Legal column names in R and consequences of syntactically invalid column names

Answers (3)

Related Questions