Pattern matching using regex python

Question

Here are the text examples :

American Psychological Association. (2016). Center for epidemiological studies depression (CESD). Retrieved December 7, 2016, from American Psychological Association, http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx
Beattie, G.S. (2005, November). Social Causes of Depression. Retrieved May 31, 2017, from http:// www.personalityresearch.org/papers/beattie.html

I want the Bold part of the text for which I tried:

/\)\.|\s[a-zA-Z]+\./

Here I look for ')' then '.' then 'space' and then text until '.'

Basically I want the text between two Dots as this is the title of paper which starts after either author or publication with year in brackets as mentioned in the example. But above pattern doesn't give what I want.

Can anyone help me why it is not working and what could be the other way to find text like these in my dataframe column?

Wiktor Stribiżew · Accepted Answer

You may use the following regex with Series.str.extract:

\)\.\s+([^.]+)

See the regex demo.

Details

\)\. - ). substring
\s+ - 1+ whitespaces
([^.]+) - Group 1: one or more chars other than a dot

In Pandas, you may use it like

df['res_col'] = df['orig_col'].str.extract(r'\)\.\s+([^.]+)', expand=False)

Update as per comments

A more specific regex that allows any known abbreviations is

[\d)]\.\s*((?:\ba\.k\.a\.|[^.])+)

See another regex demo. Details:

[\d)] - either a digit or )
\. - a dot
\s* - 0 or more whitespaces
((?:\ba\.k\.a\.|[^.])+) - Group 1: one or more occurrences of a.k.a. substring as a whole word or any char but a dot.

Pattern matching using regex python

Answers (2)

Related Questions