Ricky
Ricky

Reputation: 143

How to parse or split string variable into multiple new variables in R

Saw an answer in Python similar for this question but not R so for the sake of being redundant since python is not in my wheel house. Data variable "PublicFilings" contains multiple values that I would like to split into 4 new variables. There are three basic outputs listed below but there will be different combinations of counts for Judgments, Liens and Suits, needless to say bankruptcy is yes no but I'd like that binary. Any thoughts on a simply approach for a data frame? Id can be used as a primary key the combination of No data as an initial output, inability to use comma to separate and desire to convert yes no to binary is throwing me off.

Existing Data 
Id   PublicFilings 
1    Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0 
2    Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0 
3    No Data

No Data implies there was no match to an entity and no Public Filing data was returned

Converted Data 
Id Bankruptcies Judgments Liens Suits 
1  0             0         0     0 
2  1             0         0     0 
3 Null           Null      Null  Null



   df1 <- 
  structure(list(TranId = 1:3, 
                 Name = c("ACME Five,","ACME","WALMART"),
                 Check = c("1234","1234","1235"), 
                 Entity = c("55555","55551","55556"),
                 Match =c("0","0","0"),
                 Score = c("50","60","NA"),
                 Date = c("2019-01-01", "2019-01-02","2019-01-02"),
                 PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 10, Suits: 0", 
                                   "Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0", 
                                   "No Data"),
                 Controls =c("2015","2015","1998"),
                 NumEmpoyees = c("5","8","6"),
                 LOB = c("Retail, Food","Retail, Food","Retail, All"),
                 PayScore = c("40","42","NA"),
                 Primary = c("CEO","CEO","CFO"),
                 STARTYear = c("1982","1982","1965"),
                 SpecEvent = c("0","0","0"),
                 Filings =c("0","0","1"),
                 PayExp =c("","","1"
                 )), class = "data.frame", row.names = c(NA, -3L))

View(df1)


library(dplyr)
library(tidyr)
df1 %>%
  separate_rows(PublicFilings, sep = ",\\s+") %>%
  separate(PublicFilings, into = c("key", "value"), sep=":\\s+") %>%
  mutate(key = na_if(key, "No Data"),
         value = as.integer(value %in%  c("Yes", "1"))) %>%
  pivot_wider(names_from = key, values_from = value) %>%
  select(-`NA`)
View(df1)

    # A tibble: 3 x 20
  TranId Name  Check Entity Match Score Date  Controls NumEmpoyees LOB   PayScore Primary STARTYear SpecEvent Filings
   <int> <chr> <chr> <chr>  <chr> <chr> <chr> <chr>    <chr>       <chr> <chr>    <chr>   <chr>     <chr>     <chr>  
1      1 ACME~ 1234  55555  0     50    2019~ 2015     5           Reta~ 40       CEO     1982      0         0      
2      2 ACME  1234  55551  0     60    2019~ 2015     8           Reta~ 42       CEO     1982      0         0      
3      3 WALM~ 1235  55556  0     NA    2019~ 1998     6           Reta~ NA       CFO     1965      0         1      
# ... with 5 more variables: PayExp <chr>, Bankruptcies <int>, Judgments <int>, Liens <int>, Suits <int>
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [9].

Upvotes: 3

Views: 1371

Answers (3)

Matt L.
Matt L.

Reputation: 2974

These answers were helpful in working out my problem- I'm saving my slightly different version which essentially does the same thing, captured into a function which could be polished further.

├ Function to extract ----

extract_keyval <- function(string, sep=","){
  string = gsub("No Data", "", string, ignore.case = T) # handle missing data
  key_value =  unlist(strsplit(string, split = sep)) # vector of key-value pairs  
  vars = gsub("(.*):\\s+.*", "\\1", key_value)    # extract variable name
  values = gsub(".*:\\s+(.*)", "\\1", key_value) # extract values
  values = gsub("yes", "1", values, ignore.case = T)
  values = gsub("no", "0", values, ignore.case = T)
  names(values) = vars # attach names to value
  return(values) # named vector 
  }

I typically use the purrr package to do the transformation back to a dataframe:

├ Tidyverse option ----

df1 %>% mutate(map_df(PublicFilings, ~extract_keyval(.x)))

Upvotes: 0

akrun
akrun

Reputation: 887901

One option is to split the 'PublicFilings' at the , into 'long' format, then create two columns with separate, and reshape into 'wide' format with pivot_wider

library(dplyr)
library(tidyr)
df1 %>%
     separate_rows(PublicFilings, sep = ",\\s+") %>%
     separate(PublicFilings, into = c("key", "value"), sep=":\\s+") %>%
     mutate(key = na_if(key, "No Data"),
           value = as.integer(value %in%  c("Yes", "1"))) %>%
     pivot_wider(names_from = key, values_from = value) %>%
     select(-`NA`)
#    Id Bankruptcies Judgments Liens Suits
#1  1            0         0     0     0
#2  2            1         0     0     0
#3  3           NA        NA    NA    NA

data

df1 <- structure(list(Id = 1:3, PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0", 
"Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0", "No Data"
)), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 102710

Here is a base R solution, where strsplit() + gsub() are used

u <- strsplit(df$PublicFilings,split = ", ")
u[[which(lengths(u)==1)]] <- rep(NA, max(lengths(u)))
M <- do.call(rbind, Map(function(x) gsub(".*:\\s","",x), u))
M[M=="No"] <- 0
M[M=="Yes"] <- 1
dfout <- cbind(df[1],setNames(data.frame(M),gsub(":.*","",u[[which.max(lengths(u))]])))

such that

> dfout
  Id Bankruptcies Judgments Liens Suits
1  1            0         0     0     0
2  2            1         0     0     0
3  3         <NA>      <NA>  <NA>  <NA>

DATA

df <- structure(list(Id = 1:3, PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0", 
"Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0", "No Data"
)), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 0

Related Questions