aeongrail
aeongrail

Reputation: 1384

read.table automatic column names

I noticed that when reading a large csv file via

output <- read.table( ..., header = TRUE, sep = ",")

The data frame which was created had some blank columns. These columns followed the naming pattern

 colnames(output)
     "Factor.1"   "Factor.2"   "etc"        "Stuff"      "X"          "X.1"        "X.2"        "X.3"        "X.4"        "X.5"       
     "X.6"        "X.7"        "X.8"        "X.9"        "X.10"       "X.11"       "X.12"       "X.13"      
     "X.14"       "X.15"       "X.16"       "X.17"       "X.18"       "X.19"       "X.20"       "X.21"      
     "X.22"       "X.23"       "X.24"       "X.25"       "X.26"       "X.27"       "X.28"       "X.29"      
     "X.30"       "X.31"       "X.32"       "X.33"

I noticed that in ?read.table it states

col.names: a vector of optional names for the variables. The default is to use "V" followed by the column number.

Why is it using X for me instead of V?

Edit: This is what the csv file looks like

Date,Duration,Count,Factor 1,Factor 2,Factor 3,Hour,Day,Month,Year,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 0:00,9.99,10,GC,LS,FT,0,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 1:00,9.63125,8,GC,LS,FT,1,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 2:00,7.388888889,3,GC,LS,FT,2,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 3:00,7.087037037,9,GC,LS,FT,3,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

...

Upvotes: 2

Views: 3412

Answers (1)

Rich Scriven
Rich Scriven

Reputation: 99331

Here's the relevant code snippet from read.table()

if (header) {
    .External(C_readtablehead, file, 1L, comment.char, 
              blank.lines.skip, quote, sep, skipNul)
    if (missing(col.names)) 
        col.names <- first
    else if (length(first) != length(col.names)) 
        warning("header and 'col.names' are of different lengths")
}

It's if (missing(col.names)) col.names <- first that's important. From there, we can go back and get first, defined for this situation as

first <- scan(textConnection(file), what = "", sep = ",", 
    nlines = 1, quiet = TRUE, skip = 0, strip.white = TRUE)

which results in

#  [1] "Date"     "Duration" "Count"    "Factor 1" "Factor 2" "Factor 3" "Hour"     "Day"      "Month"   
# [10] "Year"     ""         ""         ""         ""         ""         ""         ""         ""        
# [19] ""         ""         ""         ""         ""         ""         ""         ""         ""        
# [28] ""         ""         ""         ""         ""         ""         ""         ""         ""        
# [37] ""         ""         ""         ""         ""         ""         ""         ""        

Then later on, make.names() is called on col.names, resulting in your names

make.names(first, unique = TRUE)
#  [1] "Date"     "Duration" "Count"    "Factor.1" "Factor.2" "Factor.3" "Hour"     "Day"      "Month"   
# [10] "Year"     "X"        "X.1"      "X.2"      "X.3"      "X.4"      "X.5"      "X.6"      "X.7"     
# [19] "X.8"      "X.9"      "X.10"     "X.11"     "X.12"     "X.13"     "X.14"     "X.15"     "X.16"    
# [28] "X.17"     "X.18"     "X.19"     "X.20"     "X.21"     "X.22"     "X.23"     "X.24"     "X.25"    
# [37] "X.26"     "X.27"     "X.28"     "X.29"     "X.30"     "X.31"     "X.32"     "X.33"    

The reason why we got X and not V as noted in the docs is because the next condition after if(header) is

else if (missing(col.names)) 
    col.names <- paste0("V", 1L:cols) 

But we never made it to that statement, and make.names() concatenates to X by default. There's a bit more to it than just this explanation. The best thing to do would be to go though the read.table source (it's complicated).

Data:

file <- "Date,Duration,Count,Factor 1,Factor 2,Factor 3,Hour,Day,Month,Year,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 0:00,9.99,10,GC,LS,FT,0,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 1:00,9.63125,8,GC,LS,FT,1,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 2:00,7.388888889,3,GC,LS,FT,2,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 3:00,7.087037037,9,GC,LS,FT,3,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"

Upvotes: 5

Related Questions