Reputation: 82
I have a data file that follows roughly this format:
HEADER:001,v1,v2,v3...,v10
v1,v2,v3,STATUS,v5...v6
.
.
.
HEADER:006,v1,v2,v3...v10
HEADER:012,v1,v2,v3...v10
v1,v2,v3,STATUS,v5...v6
v1,v2,v3,STATUS,v5...v6
.
.
.
etc
where each block or chunk of data leads off with a comma separated line that includes the header and a unique (not necessarily sequential) number, and then there may be 0 or more lines that are identified by the STATUS keyword in the body of the chunk.
I am reading this block in using readLines
and then splitting it into header lines and status lines to be read in as CSV separately, since they have a different number of variables:
datablocks <- readLines(filename, skipNul = T)
headers <- datablocks[grepl("HEADER", datablocks, useBytes = T)]
headers <- read.csv(text=headers, header= F, stringsAsFactors = F)
statuses <- datablocks[grepl("STATUS", datablocks, useBytes = T)]
statuses <- read.csv(text=statuses, header= F, stringsAsFactors = F)
Eventually, I would like to inner join this data, so that the variables from the header are included in each status line:
all <- headers %>% inner_join(statuses, by = c("ID" = "ID"))
But I need a way to add the unique ID of the header to each status line below it, until the next header. The only way I can think of doing this is with a for loop that runs over the initial full text datablock:
header_id <- NA
for(i in seq(1:length(datablocks))) {
is_header_line <- str_extract(datablocks[i], "HEADER:([^,]*)")
if(!is.na(is_header_line)) {
header_id <- is_header_line
}
datablocks[i] <- paste(datablocks[i], header_id, sep=",")
}
This works fine, but it's ugly, and not very... R-ish. I can't think of a way to vectorize this operation, since it needs to keep an external variable.
Am I missing something obvious here?
Edit
If the input looks literally like this
HEADER:001,a0,b0,c0,d0
e0,f0,g0,STATUS,h0,i0,j0,k0,l0,m0
HEADER:006,a1,b1,c1,d1
HEADER:012,a2,b2,c2,d2
e1,f1,g1,STATUS,h1,i1,j1,k1,l1,m1
e2,f2,g2,STATUS,h2,i2,j2,k2,l2,m2
The output should look like this:
e0,f0,g0,h0,i0,j0,k0,l0,m0,a0,b0,c0,d0,001
e1,f1,g1,h1,i1,j1,k1,l1,m1,a2,b2,c2,d2,012
e2,f2,g2,h2,i2,j2,k2,l2,m2,a2,b2,c2,d2,012
So there needs to be a column propagated from the parent (HEADER) to the children (STATUS) to inner join on.
Upvotes: 0
Views: 135
Reputation: 66945
EDIT: Thanks for the clarification. The specific input and output makes it dramatically easier to avoid misunderstandings.
Here I use tidyr::separate
to separate out the header label from the "a0,b0,c0,d0" part, and tidyr::fill
to propagate header info down into the following status rows.
library(tidyverse)
read_table(col_names = "text",
"HEADER:001,a0,b0,c0,d0
e0,f0,g0,STATUS,h0,i0,j0,k0,l0,m0
HEADER:006,a1,b1,c1,d1
HEADER:012,a2,b2,c2,d2
e1,f1,g1,STATUS,h1,i1,j1,k1,l1,m1
e2,f2,g2,STATUS,h2,i2,j2,k2,l2,m2") %>%
mutate(status_row = str_detect(text, "STATUS"),
header_row = str_detect(text, "HEADER"),
header = if_else(header_row, str_remove(text, "HEADER:"), NA_character_)) %>%
separate(header, c("header", "stub"), sep = ",", extra = "merge") %>%
fill(header, stub) %>%
filter(status_row) %>%
mutate(output = paste(str_remove(text, "STATUS,"), stub, header, sep = ",")) %>%
select(output)
Result
# A tibble: 3 x 1
output
<chr>
1 e0,f0,g0,h0,i0,j0,k0,l0,m0,a0,b0,c0,d0,001
2 e1,f1,g1,h1,i1,j1,k1,l1,m1,a2,b2,c2,d2,012
3 e2,f2,g2,h2,i2,j2,k2,l2,m2,a2,b2,c2,d2,012
Upvotes: 2