Reputation: 19375
My text file looks like the following
"
file1
cols=
col1
col2
# this is a comment
col3
data
a,b,c
d,e,f
"
As you can see, the data only starts after the data
tag and the rows before that essentially tell me what the column names are. There could be some comments which means the number of rows before the data
tag is variable.
How can I parse that in R? Possibly with some tidy
tools?
Expected output is:
# A tibble: 2 x 3
col1 col2 col3
<chr> <chr> <chr>
1 a b c
2 d e f
Thanks!
Upvotes: 1
Views: 895
Reputation: 35554
Here is a base way with scan()
. strip.white = T
to remove blank lines and comment.char = "#"
to remove lines leading with #
.
text <- scan("test.txt", "", sep = "\n", strip.white = T, comment.char = "#")
text
# [1] "file1" "cols=" "col1" "col2" "col3" "data" "a,b,c" "d,e,f"
ind1 <- which(text == "cols=")
ind2 <- which(text == "data")
df <- read.table(text = paste(text[-seq(ind2)], collapse = "\n"),
sep = ",", col.names = text[(ind1 + 1):(ind2 - 1)])
df
# col1 col2 col3
# 1 a b c
# 2 d e f
Upvotes: 3
Reputation: 10761
I saved your file as ex_text.txt
on my machine, removing the start and end quotes. Here's a solution. I don't know how extendable this is, and it might not work for "weirder" data.
# initialize
possible_names <- c()
not_data <- TRUE # stop when we find "data"
n <- 20 # lines to check the txt file
while (not_data){
# read txt line by line
possible_names <- readLines("ex_text.txt", n = n)
not_data <- all(possible_names != "data") # find data?
n <- n + 20 # increment to read more lines if necessary
}
# where does ddata start?
data_start <- which(possible_names == "data")
# remove unnecessary text and find actual column names
possible_names <- possible_names[2:(data_start-1)]
possible_names <- possible_names[""!= possible_names] # remove any blank space
col_names <- possible_names[!grepl("#.*", possible_names)] # remove comments
# read data
read.delim("ex_text.txt",
skip = data_start,
sep = ",",
col.names = col_names,
header = FALSE)
# col1 col2 col3
# 1 a b c
# 2 d e f
Upvotes: 2