Reputation: 143
Saw an answer in Python similar for this question but not R so for the sake of being redundant since python is not in my wheel house. Data variable "PublicFilings" contains multiple values that I would like to split into 4 new variables. There are three basic outputs listed below but there will be different combinations of counts for Judgments, Liens and Suits, needless to say bankruptcy is yes no but I'd like that binary. Any thoughts on a simply approach for a data frame? Id can be used as a primary key the combination of No data as an initial output, inability to use comma to separate and desire to convert yes no to binary is throwing me off.
Existing Data
Id PublicFilings
1 Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0
2 Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0
3 No Data
No Data implies there was no match to an entity and no Public Filing data was returned
Converted Data
Id Bankruptcies Judgments Liens Suits
1 0 0 0 0
2 1 0 0 0
3 Null Null Null Null
df1 <-
structure(list(TranId = 1:3,
Name = c("ACME Five,","ACME","WALMART"),
Check = c("1234","1234","1235"),
Entity = c("55555","55551","55556"),
Match =c("0","0","0"),
Score = c("50","60","NA"),
Date = c("2019-01-01", "2019-01-02","2019-01-02"),
PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 10, Suits: 0",
"Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0",
"No Data"),
Controls =c("2015","2015","1998"),
NumEmpoyees = c("5","8","6"),
LOB = c("Retail, Food","Retail, Food","Retail, All"),
PayScore = c("40","42","NA"),
Primary = c("CEO","CEO","CFO"),
STARTYear = c("1982","1982","1965"),
SpecEvent = c("0","0","0"),
Filings =c("0","0","1"),
PayExp =c("","","1"
)), class = "data.frame", row.names = c(NA, -3L))
View(df1)
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(PublicFilings, sep = ",\\s+") %>%
separate(PublicFilings, into = c("key", "value"), sep=":\\s+") %>%
mutate(key = na_if(key, "No Data"),
value = as.integer(value %in% c("Yes", "1"))) %>%
pivot_wider(names_from = key, values_from = value) %>%
select(-`NA`)
View(df1)
# A tibble: 3 x 20
TranId Name Check Entity Match Score Date Controls NumEmpoyees LOB PayScore Primary STARTYear SpecEvent Filings
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 ACME~ 1234 55555 0 50 2019~ 2015 5 Reta~ 40 CEO 1982 0 0
2 2 ACME 1234 55551 0 60 2019~ 2015 8 Reta~ 42 CEO 1982 0 0
3 3 WALM~ 1235 55556 0 NA 2019~ 1998 6 Reta~ NA CFO 1965 0 1
# ... with 5 more variables: PayExp <chr>, Bankruptcies <int>, Judgments <int>, Liens <int>, Suits <int>
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [9].
Upvotes: 3
Views: 1371
Reputation: 2974
These answers were helpful in working out my problem- I'm saving my slightly different version which essentially does the same thing, captured into a function which could be polished further.
extract_keyval <- function(string, sep=","){
string = gsub("No Data", "", string, ignore.case = T) # handle missing data
key_value = unlist(strsplit(string, split = sep)) # vector of key-value pairs
vars = gsub("(.*):\\s+.*", "\\1", key_value) # extract variable name
values = gsub(".*:\\s+(.*)", "\\1", key_value) # extract values
values = gsub("yes", "1", values, ignore.case = T)
values = gsub("no", "0", values, ignore.case = T)
names(values) = vars # attach names to value
return(values) # named vector
}
I typically use the purrr
package to do the transformation back to a dataframe:
df1 %>% mutate(map_df(PublicFilings, ~extract_keyval(.x)))
Upvotes: 0
Reputation: 887901
One option is to split the 'PublicFilings' at the ,
into 'long' format, then create two columns with separate
, and reshape into 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(PublicFilings, sep = ",\\s+") %>%
separate(PublicFilings, into = c("key", "value"), sep=":\\s+") %>%
mutate(key = na_if(key, "No Data"),
value = as.integer(value %in% c("Yes", "1"))) %>%
pivot_wider(names_from = key, values_from = value) %>%
select(-`NA`)
# Id Bankruptcies Judgments Liens Suits
#1 1 0 0 0 0
#2 2 1 0 0 0
#3 3 NA NA NA NA
df1 <- structure(list(Id = 1:3, PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0",
"Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0", "No Data"
)), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 1
Reputation: 102710
Here is a base R solution, where strsplit()
+ gsub()
are used
u <- strsplit(df$PublicFilings,split = ", ")
u[[which(lengths(u)==1)]] <- rep(NA, max(lengths(u)))
M <- do.call(rbind, Map(function(x) gsub(".*:\\s","",x), u))
M[M=="No"] <- 0
M[M=="Yes"] <- 1
dfout <- cbind(df[1],setNames(data.frame(M),gsub(":.*","",u[[which.max(lengths(u))]])))
such that
> dfout
Id Bankruptcies Judgments Liens Suits
1 1 0 0 0 0
2 2 1 0 0 0
3 3 <NA> <NA> <NA> <NA>
DATA
df <- structure(list(Id = 1:3, PublicFilings = c("Bankruptcies: No, Judgments: 0, Liens: 0, Suits: 0",
"Bankruptcies: Yes, Judgments: 0, Liens: 0, Suits: 0", "No Data"
)), class = "data.frame", row.names = c(NA, -3L))
Upvotes: 0