ℕʘʘḆḽḘ
ℕʘʘḆḽḘ

Reputation: 19375

how to parse a text file that contains the column names at the beginning of the file?

My text file looks like the following

"
file1
cols=
col1
col2
# this is a comment
col3

data
a,b,c
d,e,f
"

As you can see, the data only starts after the data tag and the rows before that essentially tell me what the column names are. There could be some comments which means the number of rows before the data tag is variable.

How can I parse that in R? Possibly with some tidy tools? Expected output is:

# A tibble: 2 x 3
  col1  col2  col3 
  <chr> <chr> <chr>
1 a     b     c    
2 d     e     f  

Thanks!

Upvotes: 1

Views: 895

Answers (2)

Darren Tsai
Darren Tsai

Reputation: 35554

Here is a base way with scan(). strip.white = T to remove blank lines and comment.char = "#" to remove lines leading with #.

text <- scan("test.txt", "", sep = "\n", strip.white = T, comment.char = "#")
text
# [1] "file1" "cols=" "col1"  "col2"  "col3"  "data"  "a,b,c" "d,e,f"

ind1 <- which(text == "cols=")
ind2 <- which(text == "data")
df <- read.table(text = paste(text[-seq(ind2)], collapse = "\n"),
                 sep = ",", col.names = text[(ind1 + 1):(ind2 - 1)])

df
#   col1 col2 col3
# 1    a    b    c
# 2    d    e    f

Upvotes: 3

bouncyball
bouncyball

Reputation: 10761

I saved your file as ex_text.txt on my machine, removing the start and end quotes. Here's a solution. I don't know how extendable this is, and it might not work for "weirder" data.

# initialize
possible_names <- c()
not_data <- TRUE # stop when we find "data"
n <- 20 # lines to check the txt file

while (not_data){
  # read txt line by line
  possible_names <- readLines("ex_text.txt", n = n)
  not_data <- all(possible_names != "data") # find data?
  n <- n + 20 # increment to read more lines if necessary
}
# where does ddata start?
data_start <- which(possible_names == "data")
# remove unnecessary text and find actual column names
possible_names <- possible_names[2:(data_start-1)] 
possible_names <- possible_names[""!= possible_names] # remove any blank space
col_names <- possible_names[!grepl("#.*", possible_names)] # remove comments
# read data
read.delim("ex_text.txt", 
           skip = data_start, 
           sep = ",",
           col.names = col_names,
           header = FALSE)

#   col1 col2 col3
# 1    a    b    c
# 2    d    e    f

Upvotes: 2

Related Questions