James
James

Reputation: 43

Split a String based on a specific punctuation

Consider the below sample dataframe;

Sample DataFrame

|                   outcome                              
| 
----------------------------------------------------------
| Stewart Young, CEO, ABC Corp; Mark Hill, CFO, DCB Corp |
| Hill Man, Executive, FC Bank                           
|

I would like to perform a string split based on ; and then split at ,. For example I would need the outcome provided below:

|          Name           |   Role   |        Company     |
-----------------------------------------------------------
| Stewart Young, Mark Hill| CEO, CFO | ABC Corp, DCB Corp |
|          Hill Man       | Executive|  FC Bank           |

I can perform a string split at ;, but I am unable to obtain exact names, roles and company.

Upvotes: 1

Views: 235

Answers (2)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522007

Here is a base R solution which seems to be working:

output <- apply(df, 1, function(r) {
    parts <- unlist(strsplit(r, ";\\s*"))
    Name <- sapply(parts, function(x) { strsplit(x, ",\\s*")[[1]][1] })
    Name <- paste(Name, collapse=", ")
    Role <- sapply(parts, function(x) { strsplit(x, ",\\s*")[[1]][2] })
    Role <- paste(Role, collapse=", ")
    Company <- sapply(parts, function(x) { strsplit(x, ",\\s*")[[1]][3] })
    Company <- paste(Company, collapse=",")
    c(Name, Role, Company)
})
output <- data.frame(t(output))
output

                    X1        X2                X3
1 Stewart Young, Mark Hill  CEO, CFO ABC Corp,DCB Corp
2                 Hill Man Executive           FC Bank

Data:

df <- data.frame(outcome=c("Stewart Young, CEO, ABC Corp; Mark Hill, CFO, DCB Corp",
    "Hill Man, Executive, FC Bank"),
    stringsAsFactors=FALSE)

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 389135

I would suggest to keep every entry in separate row instead of merging them together. Using dplyr and tidyr one way would be

library(dplyr)
library(tidyr)

df %>%
  separate_rows(outcome, sep = ";") %>%
  separate(outcome, c("Name", "Role", "Company"), sep = ", ")

#           Name       Role   Company
#1 Stewart Young        CEO  ABC Corp
#2     Mark Hill        CFO  DCB Corp
#3      Hill Man  Executive   FC Bank

However, if you need output as shown we can do

df %>%
  mutate(row = row_number()) %>%
  separate_rows(outcome, sep = ";") %>%
  separate(outcome, c("Name", "Role", "Company"), sep = ", ") %>%
  group_by(row) %>%
  summarise_all(toString) %>%
  select(-row)

#  Name                      Role      Company           
#  <chr>                     <chr>     <chr>             
#1 Stewart Young,  Mark Hill CEO, CFO  ABC Corp, DCB Corp
#2 Hill Man                  Executive FC Bank         

data

df <- structure(list(outcome = structure(2:1, .Label = c("Hill Man,
Executive, FC Bank", "Stewart Young, CEO, ABC Corp; Mark Hill, CFO, DCB Corp"), 
class = "factor")), class = "data.frame", row.names = c(NA, -2L))       

Upvotes: 2

Related Questions