Reputation: 383
Here are the text examples :
I want the Bold part of the text for which I tried:
/\)\.|\s[a-zA-Z]+\./
Here I look for ')' then '.' then 'space' and then text until '.'
Basically I want the text between two Dots as this is the title of paper which starts after either author or publication with year in brackets as mentioned in the example. But above pattern
doesn't give what I want.
Can anyone help me why it is not working and what could be the other way to find text like these in my dataframe column?
Upvotes: 1
Views: 89
Reputation: 626929
You may use the following regex with Series.str.extract
:
\)\.\s+([^.]+)
See the regex demo.
Details
\)\.
- ).
substring \s+
- 1+ whitespaces([^.]+)
- Group 1: one or more chars other than a dotIn Pandas, you may use it like
df['res_col'] = df['orig_col'].str.extract(r'\)\.\s+([^.]+)', expand=False)
Update as per comments
A more specific regex that allows any known abbreviations is
[\d)]\.\s*((?:\ba\.k\.a\.|[^.])+)
See another regex demo. Details:
[\d)]
- either a digit or )
\.
- a dot\s*
- 0 or more whitespaces((?:\ba\.k\.a\.|[^.])+)
- Group 1: one or more occurrences of a.k.a.
substring as a whole word or any char but a dot.Upvotes: 2