hfisch
hfisch

Reputation: 1332

Reading lines near comments in R with read.table

I am reading in a number of text files that contain lines of data with a few header rows at the top containing information for the data, like so:

Test file
#
File information
1 2 3 4
#
a 2
b 4
c 6
d 8

I would like to read in the various pieces of information individually from this file. I can accomplish this just fine like so:

file <- read.table(txt, nrow = 1)
name <- read.table(txt, nrow = 1, skip = 2)
vals <- read.table(txt, nrow = 1, skip = 3)
data <- read.table(txt,           skip = 5)

Due to the two blank comment lines, I could also have read in the data like this:

file <- read.table(txt, nrow = 1)
name <- read.table(txt, nrow = 1, skip = 1)  # Skip changed from 2
vals <- read.table(txt, nrow = 1, skip = 3)
data <- read.table(txt,           skip = 4)  # Skip changed from 5

This is well and good, but the text files do not always have the same number of blank comment lines; sometimes they are present, sometimes they aren't. If I lose either (or both) the the comment lines in my example text file, neither of my solutions continue to work.

Is there a more robust way to read in a text file where the skip variable will never count a comment line?

Upvotes: 0

Views: 1523

Answers (1)

r2evans
r2evans

Reputation: 160437

(Assumption: after the file metadata at the top, once the data starts, there are no more comments.)

(The use of textConnection(...) is to trick functions expecting file connections into processing a character string. Replace the function call with the filename.)

One technique is to read the first n lines of a file (some number "guaranteed" to include all of the commented/non-data rows), find the last one, and then deal with all-before and all-after accordingly:

txt <- "Test file
#
File information
1 2 3 4
#
a 2
b 4
c 6
d 8"
max_comment_lines <- 8
(dat <- readLines(textConnection(txt), n = max_comment_lines))
# [1] "Test file"        "#"                "File information" "1 2 3 4"         
# [5] "#"                "a 2"              "b 4"              "c 6"             
(skip <- max(grep("^\\s*#", dat)))
# [1] 5

(BTW: should probably do a check to ensure that there are in fact comments ... this will return integer(0) otherwise, and the read* functions don't like that as an argument.)

Now that we "know" that the last found comment is on line 5, we can use the first 4 lines to get header info ...

meta <- readLines(textConnection(txt), n = skip - 1)
meta <- meta[! grepl("^\\s*#", meta) ] # remove the comment rows themselves
meta
# [1] "Test file"        "File information" "1 2 3 4"         

... and skip 5 lines to get to the data.

dat <- read.table(textConnection(txt), skip = skip)
str(dat)
# 'data.frame': 4 obs. of  2 variables:
#  $ V1: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
#  $ V2: int  2 4 6 8

Upvotes: 2

Related Questions