Reputation: 471
thanks in advance for all this
So, I have a data frame which comes from a .tsv similar to this:
##ID_value=1829
##exportDate=1-18-2019
ChemID BasedMaterial State
MSO11D Oxygen Gas
GSX55E Carbon Liquid
Right ? So, what I simply want to do is to add a new column named ID filled with the values from the ID_value from the comments in ##, so as to get something like this:
ID ChemID BasedMaterial State
1829 MSO11D Oxygen Gas
1829 GSX55E Carbon Liquid
The thing is, when I import from .tsv, I loose all the comment values, which is good, I don't actually want them in my output file (an excel table). But by doing so, I also loose that information, which would be useful for the exposed purpose.
So, is there a way I could use the values from the comments in ## so as to create said column and also get rid of these comments for the creation of a table ? Thanks very much in advance
Upvotes: 1
Views: 430
Reputation: 160437
While StupidWolf's answer works, I think usurping the proven read.table
's file operation in lieu of parsing text is generally not a good idea: as the file gets larger, this pays a penalty (anecdotally, 20% increase at 100k lines, more for larger).
If the pattern is known to be at the top, then read the top few lines, find the relevant portion, then call read.table
on the original file (with the original arguments).
#' @param file 'character', the name of the file which the data are to be read from
#' @param ... other arguments passed to 'read.table'
#' @param meta_char 'character', the string (or pattern) that indicates a 'key=val' or 'note'
#' @param meta_rows 'integer', maximum number of rows to look for meta
#' @param meta_unnamed 'character', used for column-header of meta when no '=' is found
#' @param meta_skip_more 'integer', number of lines beyond the meta rows to skip for real data
#' @return 'data.frame', with any meta data augmented as columns
read_table_with_meta <- function(file, ...,
meta_char = "#", meta_rows = 10L, meta_unnamed = "meta",
meta_skip_more = 0L) {
toplines <- readLines(file, n = meta_rows)
meta_ptn <- paste0("^", meta_char)
dots <- list(...)
if ("skip" %in% names(dots)) {
warning("'skip' is determined by 'read_table_with_meta' and should not be assigned; if you need to skip more rows after meta rows, then use 'meta_skip_more'; 'skip=' is ignored here")
dots$skip <- NULL
}
if (all(grepl(meta_ptn, toplines))) {
stop("all lines looked like header rows, suggest you increase 'meta_rows'")
}
toplines <- toplines[ grepl(meta_ptn, toplines) ]
skip <- length(toplines) + meta_skip_more
toplines <- gsub(paste0("^", meta_char, "+\\s*"), "", toplines)
if (length(toplines)) {
keys <- gsub("\\s*=.*", "", toplines)
vals <- gsub("^[^=]*\\s*=\\s*", "", toplines)
unnamed <- (keys == vals)
keys[unnamed] <- paste0(meta_unnamed, seq_along(keys[unnamed]))
keyvals <- setNames(as.list(vals), keys)
} else keyvals <- NULL
dat <- do.call("read.table", c(list(file, skip = skip), dots))
if (is.null(keyvals)) dat else cbind(dat, keyvals)
}
Notes:
this only searches the top 10 lines (by default), thinking that you should not try to parse the entire file once you find a non-commented-out line; if you have comments mid-file, then this answer is insufficient;
this function assigns all of these rows to fields; this may not be the most generic way of handling this, but I think it addresses what you were asking for; once read in, you can discard fields you don't want;
the unnamed
portion is in case not all commented-out headers have an =
in them; just a trick, not certain it's necessary or useful to you.
Demonstration:
### safe with no-meta files
text=c("ChemID BasedMaterial State", "MSO11D Oxygen Gas", "GSX55E Carbon Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header=T)
# ChemID BasedMaterial State
# 1 MSO11D Oxygen Gas
# 2 GSX55E Carbon Liquid
### simple case
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "ChemID BasedMaterial State", "MSO11D Oxygen Gas", "GSX55E Carbon Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header = TRUE)
# ChemID BasedMaterial State ID_value exportDate
# 1 MSO11D Oxygen Gas 1829 1-18-2019
# 2 GSX55E Carbon Liquid 1829 1-18-2019
### unnamed meta
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "##somethingelse", "ChemID BasedMaterial State", "MSO11D Oxygen Gas", "GSX55E Carbon Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header = TRUE)
# ChemID BasedMaterial State ID_value exportDate meta1
# 1 MSO11D Oxygen Gas 1829 1-18-2019 somethingelse
# 2 GSX55E Carbon Liquid 1829 1-18-2019 somethingelse
### multiple unnamed meta
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "##somethingelse", "##key=val", "##more", "ChemID BasedMaterial State", "MSO11D Oxygen Gas", "GSX55E Carbon Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header = TRUE)
# ChemID BasedMaterial State ID_value exportDate meta1 key meta2
# 1 MSO11D Oxygen Gas 1829 1-18-2019 somethingelse val more
# 2 GSX55E Carbon Liquid 1829 1-18-2019 somethingelse val more
Upvotes: 3
Reputation: 46898
You can try the function below, explanations are in comments.
func = function(FILE,COMMENTCHAR,VALUE){
allLines = readLines(FILE)
#exclude lines with comments
# and make table
tab = read.table(text=allLines[!grepl(COMMENTCHAR,allLines)],header=TRUE)
#find the line which has the value in comments
value = allLines[grepl(VALUE,allLines) & grepl(COMMENTCHAR,allLines)]
# we split to get the name and value
value = unlist(strsplit(gsub("#","",value),"="))
df = data.frame(value[2],tab)
colnames(df)[1] = value[1]
return(df)
}
The main idea is to use readLines to get everything in. We convert lines without comments into table. And from the lines with comment, we search for the value you want and put it as the first column. And we try it on your text file:
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "ChemID BasedMaterial State",
"MSO11D Oxygen Gas", "GSX55E Carbon Liquid"
)
writeLines(text,"test.txt")
func("test.txt","#","ID")
ID_value ChemID BasedMaterial State
1 1829 MSO11D Oxygen Gas
2 1829 GSX55E Carbon Liquid
Upvotes: 1