Reputation: 506
I have a huge data table with two columns (id
and info
) and I want to separate/split the column info
into single columns and format them.
Here is an example, how my data table's entries look like:
id <- c(8750, 3048, 3593, 8475, 9921)
info <- c("[J/B][10,00/10,00][1,500/1,500][1,00]", "[J*/POP][0,00/0,00/0,00/0,00][2,210/2,210/2,210/2,210][1,50]", "S", "KEINE",
"[Q*/B][5,00/5,00][1,500/1,500][0,70]")
dt.test <- data.table(id, info)
dt.test
# id info
# 8750 [J/B][10,00/10,00][1,500/1,500][1,00]
# 3048 [J*/POP][0,00/0,00/0,00/0,00][2,210/2,210/2,210/2,210][1,50]
# 3593 S
# 8475 KEINE
# 9921 [Q*/B][5,00/5,00][1,500/1,500][0,70]
At the end I need a table, where the info
column is separated into six single columns (id
, type
, pBu
, pBo
, zTBu
, zTBo
and zV
) and formatting them as follows:
# id type pBu pBo zTBu zTBo zV
# 8750 J/B 10 10 1.50 1.50 1.0
# 3048 J*/POP 0 0 2.21 2.21 1.5
# 3593 S NA NA NA NA NA
# 8475 KEINE NA NA NA NA NA
# 9921 Q*/B 5 5 1.50 1.50 0.7
Just a little explanation of my info
column at the beginning and how the end data table should look:
,
in the info
column have to be replaced by an dot .
(see my end data table).[]
are defining my new type
column, and furthermore, if the info
column is filled with "S" or "KEINE", this also has to be an entry of type
column.[]
are defining the new columns pBu
(first element) and pBo
(last element).[]
are defining new columns zTBu
(first element) and zTBo
(last element).[]
is defining the zV
column.type
column has entry "S" or "KEINE", then the remaining columns have to be filled with NA
s.Upvotes: 1
Views: 79
Reputation: 6519
You code could combine the built-in function gsub
and the function tstrsplit
from data.table
. If you want to set missing values to 0 as you asked in the comment HERE, you could just set the argument fill
in the function tstrsplit
to 0. Given that you retain that your data.table is huge, note that this answer is significantly faster than the one proposed by @dy_by (about 23 times faster on my machine).
This solution requires R version 4.1.0
cols <- c("type", "pBu", "pBo", "zTBu", "zTBo", "zV")
dt.test[, (cols) := gsub(",", ".", info, fixed=TRUE) |>
gsub(pattern="^\\[|\\]$", replacement="") |>
gsub(pattern="/[0-9./]+/", replacement="/") |>
tstrsplit("\\]\\[|/(?=\\d)", perl=TRUE, type.convert=TRUE)]
dt.test[, !"info"]
# id type pBu pBo zTBu zTBo zV
# 1: 8750 J/B 10 10 1.50 1.50 1.0
# 2: 3048 J*/POP 0 0 2.21 2.21 1.5
# 3: 3593 S NA NA NA NA NA
# 4: 8475 KEINE NA NA NA NA NA
# 5: 9921 Q*/B 5 5 1.50 1.50 0.7
This is an update to take into the comment. It is meant to be used in R version older than 4.1.0
dt.test[, (cols) := {x <- gsub(",", ".", info, fixed=TRUE)
x <- gsub("^\\[|\\]$", "", x)
x <- gsub("/[0-9./]+/", "/", x)
tstrsplit(x, "\\]\\[|/(?=\\d)", perl=TRUE, type.convert=TRUE)}]
Upvotes: 1
Reputation: 1241
separate
on bracketsmutate
and word
),
to .
(map_df
)map_df
)select
to keep proper order of the
columnslibrary(tidyr)
library(stringr)
library(purrr)
dt.test %>%
separate(info, into = c('type', 'pBu - pBo', 'zTBu - zTBo', 'zV'),
sep = "\\]|\\/ " ,
remove = T) %>%
mutate('pBu' = word(`pBu - pBo`,1, sep = "\\/"),
'pBo' = word(`pBu - pBo`,-1, sep = "\\/"),
'pBu - pBo' = NULL,
'zTBu' = word(`zTBu - zTBo`,1, sep = "\\/"),
'zTBo' = word(`zTBu - zTBo`,-1, sep = "\\/"),
'zTBu - zTBo' = NULL) %>%
map_df(~ gsub(",", ".", .x)) %>%
map_df(~ gsub("\\[", "", .x)) %>%
select(id , type, pBu, pBo, zTBu, zTBo, zV) ## %>% as.data.table
output
# A tibble: 5 x 7
id type pBu pBo zTBu zTBo zV
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 8750 J/B 10.00 10.00 1.500 1.500 1.00
2 3048 J*/POP 0.00 0.00 2.210 2.210 1.50
3 3593 S NA NA NA NA NA
4 8475 KEINE NA NA NA NA NA
5 9921 Q*/B 5.00 5.00 1.500 1.500 0.70
%>% as.data.table
at the end, allow you keep data.table form
Upvotes: 2