Extract values from comments (#) in R

Question

thanks in advance for all this

So, I have a data frame which comes from a .tsv similar to this:

##ID_value=1829
##exportDate=1-18-2019 
ChemID    BasedMaterial  State
MSO11D    Oxygen         Gas
GSX55E    Carbon         Liquid

Right ? So, what I simply want to do is to add a new column named ID filled with the values from the ID_value from the comments in ##, so as to get something like this:

ID    ChemID    BasedMaterial  State
1829  MSO11D    Oxygen         Gas
1829  GSX55E    Carbon         Liquid

The thing is, when I import from .tsv, I loose all the comment values, which is good, I don't actually want them in my output file (an excel table). But by doing so, I also loose that information, which would be useful for the exposed purpose.

So, is there a way I could use the values from the comments in ## so as to create said column and also get rid of these comments for the creation of a table ? Thanks very much in advance

r2evans · Accepted Answer

While StupidWolf's answer works, I think usurping the proven read.table's file operation in lieu of parsing text is generally not a good idea: as the file gets larger, this pays a penalty (anecdotally, 20% increase at 100k lines, more for larger).

If the pattern is known to be at the top, then read the top few lines, find the relevant portion, then call read.table on the original file (with the original arguments).

#' @param file 'character', the name of the file which the data are to be read from
#' @param ... other arguments passed to 'read.table'
#' @param meta_char 'character', the string (or pattern) that indicates a 'key=val' or 'note'
#' @param meta_rows 'integer', maximum number of rows to look for meta
#' @param meta_unnamed 'character', used for column-header of meta when no '=' is found
#' @param meta_skip_more 'integer', number of lines beyond the meta rows to skip for real data
#' @return 'data.frame', with any meta data augmented as columns
read_table_with_meta <- function(file, ...,
                                 meta_char = "#", meta_rows = 10L, meta_unnamed = "meta",
                                 meta_skip_more = 0L) {
  toplines <- readLines(file, n = meta_rows)
  meta_ptn <- paste0("^", meta_char)

  dots <- list(...)
  if ("skip" %in% names(dots)) {
    warning("'skip' is determined by 'read_table_with_meta' and should not be assigned; if you need to skip more rows after meta rows, then use 'meta_skip_more'; 'skip=' is ignored here")
    dots$skip <- NULL
  }

  if (all(grepl(meta_ptn, toplines))) {
    stop("all lines looked like header rows, suggest you increase 'meta_rows'")
  }

  toplines <- toplines[ grepl(meta_ptn, toplines) ]
  skip <- length(toplines) + meta_skip_more
  toplines <- gsub(paste0("^", meta_char, "+\s*"), "", toplines)

  if (length(toplines)) {
    keys <- gsub("\s*=.*", "", toplines)
    vals <- gsub("^[^=]*\s*=\s*", "", toplines)
    unnamed <- (keys == vals)
    keys[unnamed] <- paste0(meta_unnamed, seq_along(keys[unnamed]))
    keyvals <- setNames(as.list(vals), keys)
  } else keyvals <- NULL

  dat <- do.call("read.table", c(list(file, skip = skip), dots))
  if (is.null(keyvals)) dat else cbind(dat, keyvals)
}

Notes:

this only searches the top 10 lines (by default), thinking that you should not try to parse the entire file once you find a non-commented-out line; if you have comments mid-file, then this answer is insufficient;
this function assigns all of these rows to fields; this may not be the most generic way of handling this, but I think it addresses what you were asking for; once read in, you can discard fields you don't want;
the unnamed portion is in case not all commented-out headers have an = in them; just a trick, not certain it's necessary or useful to you.

Demonstration:

### safe with no-meta files
text=c("ChemID    BasedMaterial  State", "MSO11D    Oxygen         Gas", "GSX55E    Carbon         Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header=T)
#   ChemID BasedMaterial  State
# 1 MSO11D        Oxygen    Gas
# 2 GSX55E        Carbon Liquid

### simple case
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "ChemID    BasedMaterial  State", "MSO11D    Oxygen         Gas", "GSX55E    Carbon         Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header = TRUE)
#   ChemID BasedMaterial  State ID_value exportDate
# 1 MSO11D        Oxygen    Gas     1829 1-18-2019 
# 2 GSX55E        Carbon Liquid     1829 1-18-2019 

### unnamed meta
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "##somethingelse", "ChemID    BasedMaterial  State", "MSO11D    Oxygen         Gas", "GSX55E    Carbon         Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header = TRUE)
#   ChemID BasedMaterial  State ID_value exportDate         meta1
# 1 MSO11D        Oxygen    Gas     1829 1-18-2019  somethingelse
# 2 GSX55E        Carbon Liquid     1829 1-18-2019  somethingelse

### multiple unnamed meta
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "##somethingelse", "##key=val", "##more", "ChemID    BasedMaterial  State", "MSO11D    Oxygen         Gas", "GSX55E    Carbon         Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header = TRUE)
#   ChemID BasedMaterial  State ID_value exportDate         meta1 key meta2
# 1 MSO11D        Oxygen    Gas     1829 1-18-2019  somethingelse val  more
# 2 GSX55E        Carbon Liquid     1829 1-18-2019  somethingelse val  more

Extract values from comments (#) in R

Answers (2)

Related Questions