DSGym
DSGym

Reputation: 2867

Count appearence of elements in String

I have got the following dataset:

structure(list(ID = c(5L, 6L, 7L, 8L, 10L), chain = c("x49", 
                                                      "x43", "x32 > x42 > x49 > x45 > x20 > x50 > x38", "x54 > x44", 
                                                      "x38 > x38")), row.names = c(NA, -5L), class = c("data.table", 
                                                                                                       "data.frame"))

   ID                                   chain
1:  5                                     x49
2:  6                                     x43
3:  7 x32 > x42 > x49 > x45 > x20 > x50 > x38
4:  8                               x54 > x44
5: 10                               x38 > x38

The chain columns represents the buying process of a product and also lacks some information (start and buy). The goal is to count each value in the chain twice (origin e.g. from and destination e.g. to) To be able to do this, I need to restructure the dataset. For example the restructured chain x54 > x44 should look this this:

   from  to
1 start x54
2   x54 x44
3   x44 buy

The whole result should look like this:

    from  to
1  start x49
2    x49 buy
3  start x43
4    x43 buy
5  start x32
6    x32 x42
7    x42 x49
8    x49 x45
9    x45 x20
10   x20 x50
11   x38 buy
12 start x54
13   x54 x44
14   x44 buy
15 start x54
16   x54 x44
17   x44 buy
18 start x38
19   x38 x38
20   x38 buy

I already tried this, but I´m not sure if this is a good idea (also have no real idea how to go on here).

df <- strsplit(df$chain, ">")
lapply(df, trimws)

Performance might be important, since chains can become quite long (30 Items) and the whole dataset has 100k rows.

Upvotes: 1

Views: 53

Answers (2)

akrun
akrun

Reputation: 886938

We can paste the strings at the beginning and end with str_c, use separate_rows to expand the dataset with tidyverse

library(tidyverse)
dt %>%
   mutate(chain = str_c("start > ", chain, " > buy")) %>%
   separate_rows(chain) %>% group_by(ID) %>% 
   transmute(from = chain, to = lead(chain)) %>% 
   na.omit %>% 
   ungroup %>% 
   select(-ID)
# A tibble: 18 x 2
#   from  to   
#   <chr> <chr>
# 1 start x49  
# 2 x49   buy  
# 3 start x43  
# 4 x43   buy  
# 5 start x32  
# 6 x32   x42  
# 7 x42   x49  
# 8 x49   x45  
# 9 x45   x20  
#10 x20   x50  
#11 x50   x38  
#12 x38   buy  
#13 start x54  
#14 x54   x44  
#15 x44   buy  
#16 start x38  
#17 x38   x38  
#18 x38   buy  

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388797

A base R way is to split the strings on " > " and create a dataframe combining all the values.

do.call(rbind, lapply(strsplit(df$chain, " > "), function(x) 
               data.frame(from = c("start",x), to = c(x, "buy"))))

#    from  to
#1  start x49
#2    x49 buy
#3  start x43
#4    x43 buy
#5  start x32
#6    x32 x42
#7    x42 x49
#8    x49 x45
#9    x45 x20
#10   x20 x50
#11   x50 x38
#12   x38 buy
#13 start x54
#14   x54 x44
#15   x44 buy
#16 start x38
#17   x38 x38
#18   x38 buy

Using similar approach a tidyverse way would be

library(tidyverse)
map_dfr(str_split(df$chain, " > "), ~tibble(from = c("start",.), to = c(., "buy")))

Upvotes: 1

Related Questions