William
William

Reputation: 402

R: any perfect alternative to case_when() when detecting strings with multiple conditions and replacing them?

I applied case_when to a text data of thousands of rows to detect strings with multiple conditions and replace them but got a wrong result because case_when doesn't execute the remaining conditions once a condition is met. I have seen a solution in How to detect more than one regex in a case_when statement, but the solution does not have multiplicity of multiple conditions such as in my data.

Any alternative to case_when will be is appreciated.

This is the dummy data:

statement <- structure(list(stmt = c("diabetes is common", "police not my friend"
  "transport is cheap", "english is my language", "education is my right")), 
  class = "data.frame", row.names = c(NA, -5L))

I tried to adapt the 1st solution in How to detect more than one regex in a case_when statement but could not really figure it out.

I want to detect strings in texts in column stmt and recode the column into these five domains: APC, PDP, APGA, APP and SDP. Below are strings to be detected:

APC <- c("addiction|mental||Diabetes|health|healthy|Oranga|unwell|AOD| well| surgery|dental|recovery|oranga|Mirimiri|asthma|anger|checks|alcohol|pregnant|clinical|clinic")

PDP <- c("whanau direct|whānau direct|money|transport|home|repairs|social|budget|job|housing|house|financial|finance|Ohanga|furniture|accommodation|welfare|living|work|babies arrival|AT hop card|Entitlements|ohunga|bills|electricity|water|employment")

APGA <- c("Kaupapa|Te reo|language|Tikanga|Iwi|relationship|Tikinga|Reunite|")

APP <- c("Studying|training|NCEA|ECE|Counseling|counsel|Knowledge|School|Education|matauranga|parenting|skills")

rangatiratanga <- c("self-management|Rangitiratanga|custody|police|court|CYFS|advocacy|Oranga Tamariki|rangatiratanga|section 101|EPOA|Familly issues")

Upvotes: 1

Views: 754

Answers (2)

William
William

Reputation: 402

Thanks to @Tim Biegeleisen, but detecting strings ordinarily using case_when() & grepl() may throw up errors, if cases are not ignored. The grepl() can include ignore.case = T argument in order to make string matching case insensitive, such as in the code below:

statement$col <- case_when(
      grepl(ignore.case = T, "(addiction|mental|Diabetes|health|healthy|Oranga|unwell|AOD| well| surgery|dental|recovery|oranga|Mirimiri|asthma|anger|checks|alcohol|pregnant|clinical|clinic)", statement$stmt) ~ "APC",
      grepl(ignore.case = T, "(whanau direct|whānau direct|money|transport|home|repairs|social|budget|job|housing|house|financial|finance|Ohanga|furniture|accommodation|welfare|living|work|babies arrival|AT hop card|Entitlements|ohunga|bills|electricity|water|employment)", statement$stmt) ~ "PDP",
      grepl(ignore.case = T, "(Kaupapa|Te reo|language|Tikanga|Iwi|relationship|Tikinga|Reunite)", statement$stmt) ~ "APGA",
      grepl(ignore.case = T, "(Studying|training|NCEA|ECE|Counseling|counsel|Knowledge|School|Education|matauranga|parenting|skills)", statement$stmt) ~ "APP",
      grepl(ignore.case = T, "(self-management|Rangitiratanga|custody|police|court|CYFS|advocacy|Oranga Tamariki|rangatiratanga|section 101|EPOA|Familly issues)", statement$stmt) ~ "rangatiratanga",
      TRUE ~ NA_character_
    )

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520978

You may use case_when with grepl and a regex alternation:

statement$col <- case_when(
    grepl("(addiction|mental|Diabetes|health|healthy|Oranga|unwell|AOD| well| surgery|dental|recovery|oranga|Mirimiri|asthma|anger|checks|alcohol|pregnant|clinical|clinic)", statement$stmt) ~ "APC",
    grepl("(whanau direct|whānau direct|money|transport|home|repairs|social|budget|job|housing|house|financial|finance|Ohanga|furniture|accommodation|welfare|living|work|babies arrival|AT hop card|Entitlements|ohunga|bills|electricity|water|employment)", statement$stmt) ~ "PDP",
    grepl("(Kaupapa|Te reo|language|Tikanga|Iwi|relationship|Tikinga|Reunite)", statement$stmt) ~ "APGA",
    grepl("(Studying|training|NCEA|ECE|Counseling|counsel|Knowledge|School|Education|matauranga|parenting|skills)", statement$stmt) ~ "APP",
    grepl("(self-management|Rangitiratanga|custody|police|court|CYFS|advocacy|Oranga Tamariki|rangatiratanga|section 101|EPOA|Familly issues)", statement$stmt) ~ "rangatiratanga",
    TRUE ~ NA_character_
)

Upvotes: 1

Related Questions