Sebastian Zeki
Sebastian Zeki

Reputation: 6874

Using case_when with mutate equivalent in python

I've just started with python and I'm trying to migrate some R functions I created over to python. I am stumped by how to create a mutate column based on regex conditional.

Aim

I have a dataframe that contains text. I would like to extract the part of the text that refers to a score which contains the letter m followed by a digit eg 'm3' or 'M5' etc. The idea is that if the regex is present then the digit should be extracted in to a column called MStage.

The regex is a little complicated because of various edge cases so there are necessarily multiple regexes which have to be executed in order using an ifelse (or case_when) type clause.

The R code looks like this:

 dataframe <- dataframe %>%
    mutate(
      MStage = map(
        mytext, ~ case_when(
          grepl("(?<=\\d)\\s*[Mm](?:\\s|=)*\\d+", .x,perl = TRUE) ~ stringr::str_replace(stringr::str_extract(.x, "(?<=\\d)\\s*[Mm](?:\\s|=)*\\d+"), "M", ""),
          grepl("(?=[^\\.]*[Bb]arr)[^\\.]*\\s+\\d{2}\\s*[cm]*\\s*(to |-| and)\\s*\\d{2}\\s*[cm]*\\s*", .x, ignore.case = TRUE, perl = TRUE) ~ as.character(as.numeric(sapply(stringr::str_extract_all(stringr::str_extract(.x, "\\d{2}\\s*[cm]*\\s*(to|-|and)\\s*\\d{2}\\s*[cm]*\\s*"), "\\d{2}"), function(y) abs(diff(as.numeric(y)))))),
          grepl("(?=[^\\.]*cm)(?=[^\\.]*[Bb]arr)(?=[^\\.]*(of |length))[^\\.]*", .x, perl = TRUE) ~ stringr::str_extract(paste0(stringr::str_match(.x, "(?=[^\\.]*cm)(?=[^\\.]*[Bb]arr)(?=[^\\.]*(of |length))[^\\.]*"), collapse = ""), "\\d+"),
          grepl("(\\.|^|\n)(?=[^\\.]*(small|tiny|tongue|finger))(?=[^\\.]*[Bb]arr)[^\\.]*(\\.|\n|$)", .x, perl = TRUE) ~ stringr::str_replace(.x, ".*", "1"),
          TRUE ~ "Insufficient"
        )
      )
    )

My attempt

I have started to try to convert this into python with the following code:

df = df.assign(col = ['pos' if df['text'].str.contains('(?<=\\d)\\s*[Mm](?:\\s|=)*\\d+') else 'Insuf' ])

although the error I get is:

The problem

 ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I'd like to be able to add the ability to extract the digit if the regex is true (as well as the other regexes as per the R code

Upvotes: 1

Views: 1410

Answers (1)

semblable
semblable

Reputation: 783

The reason you get that error is because if here is expecting a single boolean, but you're feeding it an entire series of booleans. Typically you can fix this by using a lambda function.

import re
pat = '(?<=\\d)\\s*[Mm](?:\\s|=)*\\d+'
df = df.assign(col = lambda x: 'pos' if re.search(pat, x) else 'Insuf')

To extract the digit, just use a different re method (probably re.match() if you just need the first occurance) en lieu of the 'pos', or replace the if else with the re.match() and fillna() with 'Insuf' on the other side of the assign().

df = df.assign(col = lambda x: re.match(pat, x)).fillna(value='Insuf')

Upvotes: 1

Related Questions