Chenglin HAN
Chenglin HAN

Reputation: 41

Split character strings in a column and insert as new rows

Although I saw similar questions and solutions, but I still couldn't solve my problem. I want to split my elements in my data.table into single values and insert into new rows.

The data is like this:

dt <- data.table("0100"=c("9103,9048,9903,7837","8738,2942,2857,4053"),
             "0101"=c("9103,9048,9903","2537,1983"))

I want it to be like this:

dt2 <- data.table("0010" = c(9103,9048,9903,7837,8738,2942,2857,4053),
              "0101" = c(9103,9048,9903,2537,1983,NA,NA,NA))

Since I just start learning R, Please help me solve this problem

Upvotes: 4

Views: 102

Answers (4)

LMc
LMc

Reputation: 18732

You could use splitstackshape::cSplit:

library(splitstackshape)
cSplit(dt, 1:ncol(dt), direction = "long", makeEqual = T, type.convert = as.numeric)

Output

   0100 0101
1: 9103 9103
2: 9048 9048
3: 9903 9903
4: 7837 <NA>
5: 8738 2537
6: 2942 1983
7: 2857 <NA>
8: 4053 <NA>

A tidyverse solution could be :

library(tidyr)
library(dplyr)
library(stringr)

dt %>% 
  summarize(across(.fns = ~ list(unlist(str_split(., ","))))) %>%
  pivot_longer(everything()) %>% 
  mutate(value = lapply(value, `length<-`, max(lengths(value)))) %>%
  pivot_wider(names_from = name,
              values_from = value) %>% 
  unnest(cols = everything())

Upvotes: 2

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21440

A solution with dyplr and tidyr:

library(dplyr)
library(tidyr)
dt1 <- dt[,1] %>%
  separate_rows(`0100`, sep = ",", convert = TRUE) 

dt2 <- dt[,2] %>%
  separate_rows(`0101`, sep = ",", convert = TRUE) %>%
  summarise(`0101` = c(`0101`, rep(NA, nrow(dt1) - nrow(dt2))))

dt <- bind_cols(dt1, dt2)
# A tibble: 8 x 2
  `0100` `0101`
   <int>  <int>
1   9103   9103
2   9048   9048
3   9903   9903
4   7837   2537
5   8738   1983
6   2942     NA
7   2857     NA
8   4053     NA

Upvotes: 1

Till
Till

Reputation: 6663

library(data.table)

Using lapply() we can process each column with a custom function. This function first applies strsplit(), turning each column into a list with two elements. By applying unlist() we turn the lists into vectors (of unequal length).

dt_l <- lapply(dt, \(x) strsplit(x, ",") |> unlist())
dt_l
#> $`0100`
#> [1] "9103" "9048" "9903" "7837" "8738" "2942" "2857" "4053"
#> 
#> $`0101`
#> [1] "9103" "9048" "9903" "2537" "1983"

By setting the length of the second column to the maximum length of dt_l, the second column is “filled in” with NAs.

length(dt_l$`0101`) <- max(lengths(dt_l))
dt_l
#> $`0100`
#> [1] "9103" "9048" "9903" "7837" "8738" "2942" "2857" "4053"
#> 
#> $`0101`
#> [1] "9103" "9048" "9903" "2537" "1983" NA     NA     NA

Now we can turn it all into a data.table again.

as.data.table(dt_l)
#>    0100 0101
#> 1: 9103 9103
#> 2: 9048 9048
#> 3: 9903 9903
#> 4: 7837 2537
#> 5: 8738 1983
#> 6: 2942 <NA>
#> 7: 2857 <NA>
#> 8: 4053 <NA>

Upvotes: 4

PavoDive
PavoDive

Reputation: 6496

This is far from a clean solution, but it produces the result you want:

# our solution will rep the column with less elements, so we need to get the row
cutoff = dt[, sum(stringi::stri_count(`0101`, fixed = ","))+2]

# We split each string at the ",", unlist the result and convert to numeric
dt2 = dt[, lapply(.SD, function(x) as.numeric(unlist(strsplit(x, ","))))]

# we replace the the rep'd values with NAs:
dt2[(cutoff + 1) : .N, `0101` := NA]

The big assumption here is that the 0101 vector will always be the shorter one.

Upvotes: 1

Related Questions