Reputation: 1384
I noticed that when reading a large csv file via
output <- read.table( ..., header = TRUE, sep = ",")
The data frame which was created had some blank columns. These columns followed the naming pattern
colnames(output)
"Factor.1" "Factor.2" "etc" "Stuff" "X" "X.1" "X.2" "X.3" "X.4" "X.5"
"X.6" "X.7" "X.8" "X.9" "X.10" "X.11" "X.12" "X.13"
"X.14" "X.15" "X.16" "X.17" "X.18" "X.19" "X.20" "X.21"
"X.22" "X.23" "X.24" "X.25" "X.26" "X.27" "X.28" "X.29"
"X.30" "X.31" "X.32" "X.33"
I noticed that in ?read.table
it states
col.names: a vector of optional names for the variables. The default is to use "V" followed by the column number.
Why is it using X for me instead of V?
Edit: This is what the csv file looks like
Date,Duration,Count,Factor 1,Factor 2,Factor 3,Hour,Day,Month,Year,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 0:00,9.99,10,GC,LS,FT,0,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 1:00,9.63125,8,GC,LS,FT,1,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 2:00,7.388888889,3,GC,LS,FT,2,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 3:00,7.087037037,9,GC,LS,FT,3,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...
Upvotes: 2
Views: 3412
Reputation: 99331
Here's the relevant code snippet from read.table()
if (header) {
.External(C_readtablehead, file, 1L, comment.char,
blank.lines.skip, quote, sep, skipNul)
if (missing(col.names))
col.names <- first
else if (length(first) != length(col.names))
warning("header and 'col.names' are of different lengths")
}
It's if (missing(col.names)) col.names <- first
that's important. From there, we can go back and get first
, defined for this situation as
first <- scan(textConnection(file), what = "", sep = ",",
nlines = 1, quiet = TRUE, skip = 0, strip.white = TRUE)
which results in
# [1] "Date" "Duration" "Count" "Factor 1" "Factor 2" "Factor 3" "Hour" "Day" "Month"
# [10] "Year" "" "" "" "" "" "" "" ""
# [19] "" "" "" "" "" "" "" "" ""
# [28] "" "" "" "" "" "" "" "" ""
# [37] "" "" "" "" "" "" "" ""
Then later on, make.names()
is called on col.names
, resulting in your names
make.names(first, unique = TRUE)
# [1] "Date" "Duration" "Count" "Factor.1" "Factor.2" "Factor.3" "Hour" "Day" "Month"
# [10] "Year" "X" "X.1" "X.2" "X.3" "X.4" "X.5" "X.6" "X.7"
# [19] "X.8" "X.9" "X.10" "X.11" "X.12" "X.13" "X.14" "X.15" "X.16"
# [28] "X.17" "X.18" "X.19" "X.20" "X.21" "X.22" "X.23" "X.24" "X.25"
# [37] "X.26" "X.27" "X.28" "X.29" "X.30" "X.31" "X.32" "X.33"
The reason why we got X
and not V
as noted in the docs is because the next condition after if(header)
is
else if (missing(col.names))
col.names <- paste0("V", 1L:cols)
But we never made it to that statement, and make.names()
concatenates to X
by default. There's a bit more to it than just this explanation. The best thing to do would be to go though the read.table
source (it's complicated).
Data:
file <- "Date,Duration,Count,Factor 1,Factor 2,Factor 3,Hour,Day,Month,Year,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 0:00,9.99,10,GC,LS,FT,0,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 1:00,9.63125,8,GC,LS,FT,1,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 2:00,7.388888889,3,GC,LS,FT,2,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1/1/2012 3:00,7.087037037,9,GC,LS,FT,3,7,1,2012,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"
Upvotes: 5