Dong
Dong

Reputation: 491

read rectangular data blocks with separate tags as new columns

I feel my situation is a typical use case in experiments where the data are logged as text file for human understanding, but not for machine consumption. Tags are interspersed with the actual data to describe the data that follows. For data analysis, the tags need to integrated with the data rows to be useful. Below is a made-up example.

TAG1, t1_1

DATA_A, 5, 3, 4, 8
DATA_A, 3, 4, 5, 7

TAG1, t1_2
TAG2, t2_1

DATA_B, 1, 2, 3, 4, 5

DATA_A, 1, 2, 3, 4

The desired parse results should be two data frames. One for DATA_A,

X1, X2, X3, X4, TAG1, TAG2
5, 3, 4, 8, t1_1, NA
3, 4, 5, 7, t1_1, NA
1, 2, 3, 4, t1_2, t2_1

and one for DATA_B

X1, X2, X3, X4, X5, TAG1, TAG2
1, 2, 3, 4, 5, t1_2, t2_1

The current method (implemented in Python) check the file line by line. If it starts with "T", then the corresponding tag variable is updated; if it starts with "DATA", then the tag values are appended to the end of the "DATA" line, and the now completed line is appended to the corresponding CSV file. In the end, the CSV files are read into data frames for data analysis.

I wonder if this data import can be done faster in one step. What I have in mind is


library(tidyverse)

text_frame <- read_lines(clipboard(), skip_empty_rows = TRUE) %>% 
  enframe(name = NULL, value = "line") 

text_frame %>% 
  separate(line, into = c("ID", "value"), extra = "merge", sep = ", ") 

which produces

# A tibble: 7 x 2
  ID     value        
  <chr>  <chr>        
1 TAG1   t1_1         
2 DATA_A 5, 3, 4, 8   
3 DATA_A 3, 4, 5, 7   
4 TAG1   t1_2         
5 TAG2   t2_1         
6 DATA_B 1, 2, 3, 4, 5
7 DATA_A 1, 2, 3, 4  

The next step is to create new column "TAG1" and "TAG2" with the value added to the row. This is where I got stuck. It is like gather for individual rows. How could I do it? Is the general approach reasonable? Any suggestions?

Fast/memory efficient solutions are welcome since the I need to deal with hundreds of ~10MB text files (they do have the same structure).

Upvotes: 0

Views: 38

Answers (1)

IceCreamToucan
IceCreamToucan

Reputation: 28675

Using the input data

text <- '
TAG1, t1_1

DATA_A, 5, 3, 4, 8
DATA_A, 3, 4, 5, 7

TAG1, t1_2
TAG2, t2_1

DATA_B, 1, 2, 3, 4, 5

DATA_A, 1, 2, 3, 4
'

You can get the tags from the second column of the imported data V2 by selecting the elements of V2 where the first column V1 is TAG[1|2], and do this for each group. Groups are identified by a variable starting at 0 and incrementing by 1 after each occurrence of [V1 contains TAG then V1 doesn't contain TAG].

Then with tags as their own columns you can remove the TAG rows, and split the data according to whether the first column contains 'B'

library(data.table)

df <- fread(text, fill = T, blank.lines.skip = T)

df[, `:=`(TAG1 = V2[V1 == 'TAG1'],
          TAG2 = V2[V1 == 'TAG2']),
   by = .(g = (rleid(grepl('TAG', V1)) - 1) %/% 2)]

df <- df[-grep('TAG', V1)] 

split(df, df[, grepl('B', V1)])

# $`FALSE`
#        V1 V2 V3 V4 V5 V6 TAG1 TAG2
# 1: DATA_A  5  3  4  8 NA t1_1 <NA>
# 2: DATA_A  3  4  5  7 NA t1_1 <NA>
# 3: DATA_A  1  2  3  4 NA t1_2 t2_1
# 
# $`TRUE`
#        V1 V2 V3 V4 V5 V6 TAG1 TAG2
# 1: DATA_B  1  2  3  4  5 t1_2 t2_1

If you don't always 2 tags and may have more or less, you can replace the step after fread above with

n_tags <- df[, as.numeric(gsub('[^0-9]', '', max(grep('TAG', V1, value = T))))]
df[, g := (rleid(grepl('TAG', V1)) - 1) %/% 2]
for(i in seq_len(n_tags))
  df[, paste0('TAG', i) := V2[V1 == paste0('TAG', i)], g]

Upvotes: 1

Related Questions