Miko
Miko

Reputation: 506

Separating column into single column and formating the entries

I have a huge data table with two columns (id and info) and I want to separate/split the column info into single columns and format them. Here is an example, how my data table's entries look like:

id <- c(8750, 3048, 3593, 8475, 9921)
info <- c("[J/B][10,00/10,00][1,500/1,500][1,00]", "[J*/POP][0,00/0,00/0,00/0,00][2,210/2,210/2,210/2,210][1,50]", "S", "KEINE",
          "[Q*/B][5,00/5,00][1,500/1,500][0,70]")
dt.test <- data.table(id, info)
dt.test
#   id                                                         info
# 8750                        [J/B][10,00/10,00][1,500/1,500][1,00]
# 3048 [J*/POP][0,00/0,00/0,00/0,00][2,210/2,210/2,210/2,210][1,50]
# 3593                                                            S
# 8475                                                        KEINE
# 9921                         [Q*/B][5,00/5,00][1,500/1,500][0,70]

At the end I need a table, where the info column is separated into six single columns (id, type, pBu, pBo, zTBu, zTBo and zV) and formatting them as follows:

#   id   type pBu pBo zTBu zTBo  zV
# 8750    J/B  10  10 1.50 1.50 1.0
# 3048 J*/POP   0   0 2.21 2.21 1.5
# 3593      S  NA  NA   NA   NA  NA
# 8475  KEINE  NA  NA   NA   NA  NA
# 9921   Q*/B   5   5 1.50 1.50 0.7

Just a little explanation of my info column at the beginning and how the end data table should look:

Upvotes: 1

Views: 79

Answers (2)

B. Christian Kamgang
B. Christian Kamgang

Reputation: 6519

You code could combine the built-in function gsub and the function tstrsplit from data.table. If you want to set missing values to 0 as you asked in the comment HERE, you could just set the argument fill in the function tstrsplit to 0. Given that you retain that your data.table is huge, note that this answer is significantly faster than the one proposed by @dy_by (about 23 times faster on my machine).

This solution requires R version 4.1.0

cols <- c("type", "pBu", "pBo", "zTBu", "zTBo", "zV")

dt.test[, (cols) := gsub(",", ".", info, fixed=TRUE) |>
                    gsub(pattern="^\\[|\\]$", replacement="") |> 
                    gsub(pattern="/[0-9./]+/", replacement="/") |> 
                    tstrsplit("\\]\\[|/(?=\\d)", perl=TRUE, type.convert=TRUE)]


dt.test[, !"info"]
#       id   type   pBu   pBo  zTBu  zTBo    zV
# 1:  8750    J/B    10    10  1.50  1.50   1.0
# 2:  3048 J*/POP     0     0  2.21  2.21   1.5
# 3:  3593      S    NA    NA    NA    NA    NA
# 4:  8475  KEINE    NA    NA    NA    NA    NA
# 5:  9921   Q*/B     5     5  1.50  1.50   0.7

Update

This is an update to take into the comment. It is meant to be used in R version older than 4.1.0

dt.test[, (cols) := {x <- gsub(",", ".", info, fixed=TRUE)
                     x <- gsub("^\\[|\\]$", "", x)
                     x <- gsub("/[0-9./]+/", "/", x)
                     tstrsplit(x, "\\]\\[|/(?=\\d)", perl=TRUE, type.convert=TRUE)}]

Upvotes: 1

dy_by
dy_by

Reputation: 1241

  1. separate on brackets
  2. take first and last part of it (mutate and word)
  3. sub , to . (map_df)
  4. clean up left bracket (second map_df)
  5. select to keep proper order of the columns
library(tidyr)
library(stringr)
library(purrr)
dt.test %>% 
  separate(info, into = c('type', 'pBu - pBo', 'zTBu - zTBo',  'zV'), 
           sep = "\\]|\\/ " , 
           remove = T) %>% 
  mutate('pBu' = word(`pBu - pBo`,1, sep = "\\/"),
         'pBo' = word(`pBu - pBo`,-1, sep = "\\/"),
         'pBu - pBo' = NULL,
         
         'zTBu' = word(`zTBu - zTBo`,1, sep = "\\/"),
         'zTBo' = word(`zTBu - zTBo`,-1, sep = "\\/"),
         'zTBu - zTBo' = NULL) %>% 
  map_df(~ gsub(",", ".", .x)) %>%
  map_df(~ gsub("\\[", "", .x)) %>% 
  select(id ,  type, pBu, pBo, zTBu, zTBo,  zV) ## %>% as.data.table


output
# A tibble: 5 x 7
  id    type   pBu   pBo   zTBu  zTBo  zV   
  <chr> <chr>  <chr> <chr> <chr> <chr> <chr>
1 8750  J/B    10.00 10.00 1.500 1.500 1.00 
2 3048  J*/POP 0.00  0.00  2.210 2.210 1.50 
3 3593  S      NA    NA    NA    NA    NA   
4 8475  KEINE  NA    NA    NA    NA    NA   
5 9921  Q*/B   5.00  5.00  1.500 1.500 0.70 

%>% as.data.table at the end, allow you keep data.table form

Upvotes: 2

Related Questions