grosa
grosa

Reputation: 82

Clever way to avoid for loop in R

I have a data file that follows roughly this format:

HEADER:001,v1,v2,v3...,v10
v1,v2,v3,STATUS,v5...v6
.
.
.
HEADER:006,v1,v2,v3...v10
HEADER:012,v1,v2,v3...v10
v1,v2,v3,STATUS,v5...v6
v1,v2,v3,STATUS,v5...v6
.
.
.
etc

where each block or chunk of data leads off with a comma separated line that includes the header and a unique (not necessarily sequential) number, and then there may be 0 or more lines that are identified by the STATUS keyword in the body of the chunk.

I am reading this block in using readLines and then splitting it into header lines and status lines to be read in as CSV separately, since they have a different number of variables:

datablocks <- readLines(filename, skipNul = T)

headers <- datablocks[grepl("HEADER", datablocks, useBytes = T)]
headers <- read.csv(text=headers, header= F, stringsAsFactors = F)

statuses <- datablocks[grepl("STATUS", datablocks, useBytes = T)]
statuses <- read.csv(text=statuses, header= F, stringsAsFactors = F)

Eventually, I would like to inner join this data, so that the variables from the header are included in each status line:

 all <- headers %>% inner_join(statuses, by = c("ID" = "ID"))

But I need a way to add the unique ID of the header to each status line below it, until the next header. The only way I can think of doing this is with a for loop that runs over the initial full text datablock:

header_id <- NA
for(i in seq(1:length(datablocks))) {
  is_header_line <- str_extract(datablocks[i], "HEADER:([^,]*)")
  if(!is.na(is_header_line)) {
    header_id <- is_header_line
  }
  datablocks[i] <- paste(datablocks[i], header_id, sep=",")
}

This works fine, but it's ugly, and not very... R-ish. I can't think of a way to vectorize this operation, since it needs to keep an external variable.

Am I missing something obvious here?

Edit

If the input looks literally like this

HEADER:001,a0,b0,c0,d0
e0,f0,g0,STATUS,h0,i0,j0,k0,l0,m0
HEADER:006,a1,b1,c1,d1
HEADER:012,a2,b2,c2,d2
e1,f1,g1,STATUS,h1,i1,j1,k1,l1,m1
e2,f2,g2,STATUS,h2,i2,j2,k2,l2,m2

The output should look like this:

e0,f0,g0,h0,i0,j0,k0,l0,m0,a0,b0,c0,d0,001
e1,f1,g1,h1,i1,j1,k1,l1,m1,a2,b2,c2,d2,012
e2,f2,g2,h2,i2,j2,k2,l2,m2,a2,b2,c2,d2,012

So there needs to be a column propagated from the parent (HEADER) to the children (STATUS) to inner join on.

Upvotes: 0

Views: 135

Answers (1)

Jon Spring
Jon Spring

Reputation: 66945

EDIT: Thanks for the clarification. The specific input and output makes it dramatically easier to avoid misunderstandings.

Here I use tidyr::separate to separate out the header label from the "a0,b0,c0,d0" part, and tidyr::fill to propagate header info down into the following status rows.

library(tidyverse)
read_table(col_names = "text",
         "HEADER:001,a0,b0,c0,d0
         e0,f0,g0,STATUS,h0,i0,j0,k0,l0,m0
         HEADER:006,a1,b1,c1,d1
         HEADER:012,a2,b2,c2,d2
         e1,f1,g1,STATUS,h1,i1,j1,k1,l1,m1
         e2,f2,g2,STATUS,h2,i2,j2,k2,l2,m2") %>%

mutate(status_row = str_detect(text, "STATUS"),
       header_row = str_detect(text, "HEADER"),
       header = if_else(header_row, str_remove(text, "HEADER:"), NA_character_)) %>%
  separate(header, c("header", "stub"), sep = ",", extra = "merge") %>%
  fill(header, stub) %>%
  filter(status_row) %>%
  mutate(output = paste(str_remove(text, "STATUS,"), stub, header, sep = ",")) %>%
  select(output)

Result

# A tibble: 3 x 1
  output                                    
  <chr>                                     
1 e0,f0,g0,h0,i0,j0,k0,l0,m0,a0,b0,c0,d0,001
2 e1,f1,g1,h1,i1,j1,k1,l1,m1,a2,b2,c2,d2,012
3 e2,f2,g2,h2,i2,j2,k2,l2,m2,a2,b2,c2,d2,012

Upvotes: 2

Related Questions