Reputation: 1332
I am reading in a number of text files that contain lines of data with a few header rows at the top containing information for the data, like so:
Test file
#
File information
1 2 3 4
#
a 2
b 4
c 6
d 8
I would like to read in the various pieces of information individually from this file. I can accomplish this just fine like so:
file <- read.table(txt, nrow = 1)
name <- read.table(txt, nrow = 1, skip = 2)
vals <- read.table(txt, nrow = 1, skip = 3)
data <- read.table(txt, skip = 5)
Due to the two blank comment lines, I could also have read in the data like this:
file <- read.table(txt, nrow = 1)
name <- read.table(txt, nrow = 1, skip = 1) # Skip changed from 2
vals <- read.table(txt, nrow = 1, skip = 3)
data <- read.table(txt, skip = 4) # Skip changed from 5
This is well and good, but the text files do not always have the same number of blank comment lines; sometimes they are present, sometimes they aren't. If I lose either (or both) the the comment lines in my example text file, neither of my solutions continue to work.
Is there a more robust way to read in a text file where the skip
variable will never count a comment line?
Upvotes: 0
Views: 1523
Reputation: 160437
(Assumption: after the file metadata at the top, once the data starts, there are no more comments.)
(The use of textConnection(...)
is to trick functions expecting file connections into processing a character string. Replace the function call with the filename.)
One technique is to read the first n
lines of a file (some number "guaranteed" to include all of the commented/non-data rows), find the last one, and then deal with all-before and all-after accordingly:
txt <- "Test file
#
File information
1 2 3 4
#
a 2
b 4
c 6
d 8"
max_comment_lines <- 8
(dat <- readLines(textConnection(txt), n = max_comment_lines))
# [1] "Test file" "#" "File information" "1 2 3 4"
# [5] "#" "a 2" "b 4" "c 6"
(skip <- max(grep("^\\s*#", dat)))
# [1] 5
(BTW: should probably do a check to ensure that there are in fact comments ... this will return integer(0)
otherwise, and the read*
functions don't like that as an argument.)
Now that we "know" that the last found comment is on line 5, we can use the first 4 lines to get header info ...
meta <- readLines(textConnection(txt), n = skip - 1)
meta <- meta[! grepl("^\\s*#", meta) ] # remove the comment rows themselves
meta
# [1] "Test file" "File information" "1 2 3 4"
... and skip 5 lines to get to the data.
dat <- read.table(textConnection(txt), skip = skip)
str(dat)
# 'data.frame': 4 obs. of 2 variables:
# $ V1: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
# $ V2: int 2 4 6 8
Upvotes: 2