Extract data from the file - Ignore leading comment lines, but include later lines starting with #

Question

I have a text file which would always have a header (without "#" in beginning of the line). There may be some lines before header, all having "#" in beginning of the each line. There can be some lines within the data, which also start with "#"

I need to identify these "#" symbol lines before header and skip them before reading the file.

data

#version 2.4    
##  
## Oncotator v1.0.0.0rc16| Gaf 3.0 | UniProt_AAxform 2011_09    
## OxoG Filter v3   
Hugo_Symbol Entrez_Gene_Id
BAGE1   0
BAGE1   0
#errt      23
RTRRT      23

I want to skip 4 lines and read the file with header I tried

dum.data<-readLines(filename)
top<-"^#"
if(grepl((top,dum.data[1])){
    ret <- grep(top,dum.data)
}

But in this case, I need to identify only "#" lines(if any) before header. not in between of the data.

Matthew Lundberg · Accepted Answer

Check for leading comment lines by using rle and diff. Remove only the first group, and only if it precedes any non-comment lines:

r <- rle(diff(grep('^#', dum.data)))
dum.data <- if (length(r$values) && r$values[1] == 1) tail(dum.data, -(r$lengths[1]+1)) else dum.data
dum.data
## [1] "Hugo_Symbol Entrez_Gene_Id"
## [2] "BAGE1   0"                 
## [3] "BAGE1   0"                 
## [4] "#errt      23"             
## [5] "RTRRT      23"

Then use this to initialize a textConnection and read the table.

Extract data from the file - Ignore leading comment lines, but include later lines starting with #

Answers (1)

Related Questions