Extract regex matches, and not groups, in data frames rows in Python

Question

I am a novice in coding and I generally use R for this (stringr) but I started to learn also Python's syntax.

I have a data frame with one column generated from an imported excel file. The values in this column contain both capital and smallcase characters, symbols and numbers.

I would like to generate a second column in the data frame containing only some of these words included in the first column according to a regex pattern.

df = pd.DataFrame(["THIS IS A TEST 123123. s.m.", "THIS IS A Test test 123 .s.c.e", "TESTING T'TEST 123 da."],columns=['Test'])

df

Now, to extract what I want (words in capital case), in R I would generally use:

df <- str_extract_all(df$Test, "\b[A-Z]{1,}\b", simplify = FALSE)

to extract the matches of the regular expression in different data frame rows, which are:

* THIS IS A TEST
* THIS IS A
* TESTING T TEST

I couldn't find a similar solution for Python, and the closest I've got to is the following:

df["Name"] = df["Test"].str.extract(r"(\b[A-Z]{1,}\b)", expand = True)

Unfortunately this does not work, as it exports only the groups rather than the matches of the regex. I've tried multiple strategies, but also str.extractall does not seem to work ("TypeError: incompatible index of inserted column with frame index)

How can I extract the information I want with Python?

Thanks!

godot · Accepted Answer

If I understand well, you can try:

df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)")
                       .unstack().fillna('').apply(' '.join, 1)

[EDIT]: Here is a shorter version I discovered by looking at the doc:

 df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)").unstack(fill_value='').apply(' '.join, 1)

Extract regex matches, and not groups, in data frames rows in Python

Answers (2)

Related Questions