pekoz1
pekoz1

Reputation: 23

pandas str extract use regex to select all letters from a string only returning 1st match - str extract or regex issue?

In Pandas I have a dataframe column called TermNew containing the following lowercase strings (please ignore bullet points - I was having trouble formatting)

TermNew

Im trying to extract all letter characters (a-z only, no digits, no whitespace, no () or /) from TermNew into a new column Termtext with these expected outcomes

Termtext

Ive tried the following but its only returning letters up to the first white space i.e.

leaseterm1['Termtext'] = leaseterm1['TermNew'].str.extract(r"([a-z]+)")

Outputs

In regex101 I can use the global flag to match all letters correctly See example

Questions

1/ Is this a problem with str extract only finding the first match or

2/ Is this a regex problem - I havent included any form of global search past the 1st whitespace?

Any suggestions gratefully received. Thanks

Upvotes: 0

Views: 6514

Answers (2)

SeaBean
SeaBean

Reputation: 23217

You can use str.extractall() instead and aggregate the results of multiple matches, as follows:

leaseterm1['Termtext'] = leaseterm1['TermNew'].str.extractall(r"([a-z]+)").groupby(level=0).agg(''.join)

or use GroupBy.sum for aggregation:

leaseterm1['Termtext'] = leaseterm1['TermNew'].str.extractall(r"([a-z]+)").groupby(level=0).sum(numeric_only=False)

Result:

print(leaseterm1)

                                    TermNew           Termtext
0                  999 years from 1/01/2001          yearsfrom
1  999 years (less 20 days) from 20/11/2000  yearslessdaysfrom
2                   99 years from 1/10/1979          yearsfrom
3                  999 years from 1/01/1992          yearsfrom

Regarding to your questions:

As you can see from the official doc of str.extract()

For each subject string in the Series, extract groups from the first match of regular expression pat.

str.extract() extracts the first match only.

If you want to extract for multiple matches, you should use str.extractall() instead.

For str.extractall():

For each subject string in the Series, extract groups from all matches of regular expression pat.

Upvotes: 2

The fourth bird
The fourth bird

Reputation: 163467

It is easier to replace all characters other than a-z

leaseterm1['Termtext'] = leaseterm1['TermNew'].str.replace(r"[^a-z]+", "")

Output

                                    TermNew           Termtext
0                  999 years from 1/01/2001          yearsfrom
1  999 years (less 20 days) from 20/11/2000  yearslessdaysfrom
2                   99 years from 1/10/1979          yearsfrom
3                  999 years from 1/01/1992          yearsfrom

Upvotes: 2

Related Questions