Reputation: 6874
I've just started with python and I'm trying to migrate some R functions I created over to python. I am stumped by how to create a mutate column based on regex conditional.
Aim
I have a dataframe that contains text. I would like to extract the part of the text that refers to a score which contains the letter m followed by a digit eg 'm3' or 'M5' etc. The idea is that if the regex is present then the digit should be extracted in to a column called MStage.
The regex is a little complicated because of various edge cases so there are necessarily multiple regexes which have to be executed in order using an ifelse (or case_when) type clause.
The R code looks like this:
dataframe <- dataframe %>%
mutate(
MStage = map(
mytext, ~ case_when(
grepl("(?<=\\d)\\s*[Mm](?:\\s|=)*\\d+", .x,perl = TRUE) ~ stringr::str_replace(stringr::str_extract(.x, "(?<=\\d)\\s*[Mm](?:\\s|=)*\\d+"), "M", ""),
grepl("(?=[^\\.]*[Bb]arr)[^\\.]*\\s+\\d{2}\\s*[cm]*\\s*(to |-| and)\\s*\\d{2}\\s*[cm]*\\s*", .x, ignore.case = TRUE, perl = TRUE) ~ as.character(as.numeric(sapply(stringr::str_extract_all(stringr::str_extract(.x, "\\d{2}\\s*[cm]*\\s*(to|-|and)\\s*\\d{2}\\s*[cm]*\\s*"), "\\d{2}"), function(y) abs(diff(as.numeric(y)))))),
grepl("(?=[^\\.]*cm)(?=[^\\.]*[Bb]arr)(?=[^\\.]*(of |length))[^\\.]*", .x, perl = TRUE) ~ stringr::str_extract(paste0(stringr::str_match(.x, "(?=[^\\.]*cm)(?=[^\\.]*[Bb]arr)(?=[^\\.]*(of |length))[^\\.]*"), collapse = ""), "\\d+"),
grepl("(\\.|^|\n)(?=[^\\.]*(small|tiny|tongue|finger))(?=[^\\.]*[Bb]arr)[^\\.]*(\\.|\n|$)", .x, perl = TRUE) ~ stringr::str_replace(.x, ".*", "1"),
TRUE ~ "Insufficient"
)
)
)
My attempt
I have started to try to convert this into python with the following code:
df = df.assign(col = ['pos' if df['text'].str.contains('(?<=\\d)\\s*[Mm](?:\\s|=)*\\d+') else 'Insuf' ])
although the error I get is:
The problem
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I'd like to be able to add the ability to extract the digit if the regex is true (as well as the other regexes as per the R code
Upvotes: 1
Views: 1410
Reputation: 783
The reason you get that error is because if
here is expecting a single boolean, but you're feeding it an entire series of booleans. Typically you can fix this by using a lambda function.
import re
pat = '(?<=\\d)\\s*[Mm](?:\\s|=)*\\d+'
df = df.assign(col = lambda x: 'pos' if re.search(pat, x) else 'Insuf')
To extract the digit, just use a different re
method (probably re.match()
if you just need the first occurance) en lieu of the 'pos'
, or replace the if else
with the re.match()
and fillna()
with 'Insuf'
on the other side of the assign()
.
df = df.assign(col = lambda x: re.match(pat, x)).fillna(value='Insuf')
Upvotes: 1