Separating column into single column and formating the entries

Question

I have a huge data table with two columns (id and info) and I want to separate/split the column info into single columns and format them. Here is an example, how my data table's entries look like:

id <- c(8750, 3048, 3593, 8475, 9921)
info <- c("[J/B][10,00/10,00][1,500/1,500][1,00]", "[J*/POP][0,00/0,00/0,00/0,00][2,210/2,210/2,210/2,210][1,50]", "S", "KEINE",
          "[Q*/B][5,00/5,00][1,500/1,500][0,70]")
dt.test <- data.table(id, info)
dt.test
#   id                                                         info
# 8750                        [J/B][10,00/10,00][1,500/1,500][1,00]
# 3048 [J*/POP][0,00/0,00/0,00/0,00][2,210/2,210/2,210/2,210][1,50]
# 3593                                                            S
# 8475                                                        KEINE
# 9921                         [Q*/B][5,00/5,00][1,500/1,500][0,70]

At the end I need a table, where the info column is separated into six single columns (id, type, pBu, pBo, zTBu, zTBo and zV) and formatting them as follows:

#   id   type pBu pBo zTBu zTBo  zV
# 8750    J/B  10  10 1.50 1.50 1.0
# 3048 J*/POP   0   0 2.21 2.21 1.5
# 3593      S  NA  NA   NA   NA  NA
# 8475  KEINE  NA  NA   NA   NA  NA
# 9921   Q*/B   5   5 1.50 1.50 0.7

Just a little explanation of my info column at the beginning and how the end data table should look:

ALL commas , in the info column have to be replaced by an dot . (see my end data table).
The inputs of the first brackets [] are defining my new type column, and furthermore, if the info column is filled with "S" or "KEINE", this also has to be an entry of type column.
The inputs of the second brackets [] are defining the new columns pBu (first element) and pBo (last element).
The inputs of the third brackets [] are defining new columns zTBu (first element) and zTBo (last element).
The input of the fourth brackets [] is defining the zV column.
When type column has entry "S" or "KEINE", then the remaining columns have to be filled with NAs.

B. Christian Kamgang · Accepted Answer

You code could combine the built-in function gsub and the function tstrsplit from data.table. If you want to set missing values to 0 as you asked in the comment HERE, you could just set the argument fill in the function tstrsplit to 0. Given that you retain that your data.table is huge, note that this answer is significantly faster than the one proposed by @dy_by (about 23 times faster on my machine).

This solution requires R version 4.1.0

cols <- c("type", "pBu", "pBo", "zTBu", "zTBo", "zV")

dt.test[, (cols) := gsub(",", ".", info, fixed=TRUE) |>
                    gsub(pattern="^$$|$$$", replacement="") |> 
                    gsub(pattern="/[0-9./]+/", replacement="/") |> 
                    tstrsplit("\]$$|/(?=\d)", perl=TRUE, type.convert=TRUE)]


dt.test[, !"info"]
#       id   type   pBu   pBo  zTBu  zTBo    zV
# 1:  8750    J/B    10    10  1.50  1.50   1.0
# 2:  3048 J*/POP     0     0  2.21  2.21   1.5
# 3:  3593      S    NA    NA    NA    NA    NA
# 4:  8475  KEINE    NA    NA    NA    NA    NA
# 5:  9921   Q*/B     5     5  1.50  1.50   0.7

Update

This is an update to take into the comment. It is meant to be used in R version older than 4.1.0

dt.test[, (cols) := {x <- gsub(",", ".", info, fixed=TRUE)
                     x <- gsub("^\[|$$$", "", x)
                     x <- gsub("/[0-9./]+/", "/", x)
                     tstrsplit(x, "\]\[|/(?=\d)", perl=TRUE, type.convert=TRUE)}]

Separating column into single column and formating the entries

Answers (2)

Update

Related Questions