Reputation: 491
I feel my situation is a typical use case in experiments where the data are logged as text file for human understanding, but not for machine consumption. Tags are interspersed with the actual data to describe the data that follows. For data analysis, the tags need to integrated with the data rows to be useful. Below is a made-up example.
TAG1, t1_1 DATA_A, 5, 3, 4, 8 DATA_A, 3, 4, 5, 7 TAG1, t1_2 TAG2, t2_1 DATA_B, 1, 2, 3, 4, 5 DATA_A, 1, 2, 3, 4
The desired parse results should be two data frames. One for DATA_A,
X1, X2, X3, X4, TAG1, TAG2 5, 3, 4, 8, t1_1, NA 3, 4, 5, 7, t1_1, NA 1, 2, 3, 4, t1_2, t2_1
and one for DATA_B
X1, X2, X3, X4, X5, TAG1, TAG2 1, 2, 3, 4, 5, t1_2, t2_1
The current method (implemented in Python) check the file line by line. If it starts with "T", then the corresponding tag variable is updated; if it starts with "DATA", then the tag values are appended to the end of the "DATA" line, and the now completed line is appended to the corresponding CSV file. In the end, the CSV files are read into data frames for data analysis.
I wonder if this data import can be done faster in one step. What I have in mind is
library(tidyverse)
text_frame <- read_lines(clipboard(), skip_empty_rows = TRUE) %>%
enframe(name = NULL, value = "line")
text_frame %>%
separate(line, into = c("ID", "value"), extra = "merge", sep = ", ")
which produces
# A tibble: 7 x 2
ID value
<chr> <chr>
1 TAG1 t1_1
2 DATA_A 5, 3, 4, 8
3 DATA_A 3, 4, 5, 7
4 TAG1 t1_2
5 TAG2 t2_1
6 DATA_B 1, 2, 3, 4, 5
7 DATA_A 1, 2, 3, 4
The next step is to create new column "TAG1" and "TAG2" with the value added to the row. This is where I got stuck. It is like gather
for individual rows. How could I do it? Is the general approach reasonable? Any suggestions?
Fast/memory efficient solutions are welcome since the I need to deal with hundreds of ~10MB text files (they do have the same structure).
Upvotes: 0
Views: 38
Reputation: 28675
Using the input data
text <- '
TAG1, t1_1
DATA_A, 5, 3, 4, 8
DATA_A, 3, 4, 5, 7
TAG1, t1_2
TAG2, t2_1
DATA_B, 1, 2, 3, 4, 5
DATA_A, 1, 2, 3, 4
'
You can get the tags from the second column of the imported data V2
by selecting the elements of V2
where the first column V1
is TAG[1|2], and do this for each group. Groups are identified by a variable starting at 0 and incrementing by 1 after each occurrence of [V1 contains TAG then V1 doesn't contain TAG].
Then with tags as their own columns you can remove the TAG rows, and split the data according to whether the first column contains 'B'
library(data.table)
df <- fread(text, fill = T, blank.lines.skip = T)
df[, `:=`(TAG1 = V2[V1 == 'TAG1'],
TAG2 = V2[V1 == 'TAG2']),
by = .(g = (rleid(grepl('TAG', V1)) - 1) %/% 2)]
df <- df[-grep('TAG', V1)]
split(df, df[, grepl('B', V1)])
# $`FALSE`
# V1 V2 V3 V4 V5 V6 TAG1 TAG2
# 1: DATA_A 5 3 4 8 NA t1_1 <NA>
# 2: DATA_A 3 4 5 7 NA t1_1 <NA>
# 3: DATA_A 1 2 3 4 NA t1_2 t2_1
#
# $`TRUE`
# V1 V2 V3 V4 V5 V6 TAG1 TAG2
# 1: DATA_B 1 2 3 4 5 t1_2 t2_1
If you don't always 2 tags and may have more or less, you can replace the step after fread
above with
n_tags <- df[, as.numeric(gsub('[^0-9]', '', max(grep('TAG', V1, value = T))))]
df[, g := (rleid(grepl('TAG', V1)) - 1) %/% 2]
for(i in seq_len(n_tags))
df[, paste0('TAG', i) := V2[V1 == paste0('TAG', i)], g]
Upvotes: 1