read rectangular data blocks with separate tags as new columns

Question

I feel my situation is a typical use case in experiments where the data are logged as text file for human understanding, but not for machine consumption. Tags are interspersed with the actual data to describe the data that follows. For data analysis, the tags need to integrated with the data rows to be useful. Below is a made-up example.

TAG1, t1_1

DATA_A, 5, 3, 4, 8
DATA_A, 3, 4, 5, 7

TAG1, t1_2
TAG2, t2_1

DATA_B, 1, 2, 3, 4, 5

DATA_A, 1, 2, 3, 4

The desired parse results should be two data frames. One for DATA_A,

X1, X2, X3, X4, TAG1, TAG2
5, 3, 4, 8, t1_1, NA
3, 4, 5, 7, t1_1, NA
1, 2, 3, 4, t1_2, t2_1

and one for DATA_B

X1, X2, X3, X4, X5, TAG1, TAG2
1, 2, 3, 4, 5, t1_2, t2_1

The current method (implemented in Python) check the file line by line. If it starts with "T", then the corresponding tag variable is updated; if it starts with "DATA", then the tag values are appended to the end of the "DATA" line, and the now completed line is appended to the corresponding CSV file. In the end, the CSV files are read into data frames for data analysis.

I wonder if this data import can be done faster in one step. What I have in mind is


library(tidyverse)

text_frame <- read_lines(clipboard(), skip_empty_rows = TRUE) %>% 
  enframe(name = NULL, value = "line") 

text_frame %>% 
  separate(line, into = c("ID", "value"), extra = "merge", sep = ", ")

which produces

# A tibble: 7 x 2
  ID     value        
            
1 TAG1   t1_1         
2 DATA_A 5, 3, 4, 8   
3 DATA_A 3, 4, 5, 7   
4 TAG1   t1_2         
5 TAG2   t2_1         
6 DATA_B 1, 2, 3, 4, 5
7 DATA_A 1, 2, 3, 4

The next step is to create new column "TAG1" and "TAG2" with the value added to the row. This is where I got stuck. It is like gather for individual rows. How could I do it? Is the general approach reasonable? Any suggestions?

Fast/memory efficient solutions are welcome since the I need to deal with hundreds of ~10MB text files (they do have the same structure).

read rectangular data blocks with separate tags as new columns

Answers (1)

Related Questions