loving_guy
loving_guy

Reputation: 383

Pattern matching using regex python

Here are the text examples :

I want the Bold part of the text for which I tried:

/\)\.|\s[a-zA-Z]+\./

Here I look for ')' then '.' then 'space' and then text until '.'

Basically I want the text between two Dots as this is the title of paper which starts after either author or publication with year in brackets as mentioned in the example. But above pattern doesn't give what I want.

Can anyone help me why it is not working and what could be the other way to find text like these in my dataframe column?

Upvotes: 1

Views: 89

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626929

You may use the following regex with Series.str.extract:

\)\.\s+([^.]+)

See the regex demo.

Details

  • \)\. - ). substring
  • \s+ - 1+ whitespaces
  • ([^.]+) - Group 1: one or more chars other than a dot

In Pandas, you may use it like

df['res_col'] = df['orig_col'].str.extract(r'\)\.\s+([^.]+)', expand=False)

Update as per comments

A more specific regex that allows any known abbreviations is

[\d)]\.\s*((?:\ba\.k\.a\.|[^.])+)

See another regex demo. Details:

  • [\d)] - either a digit or )
  • \. - a dot
  • \s* - 0 or more whitespaces
  • ((?:\ba\.k\.a\.|[^.])+) - Group 1: one or more occurrences of a.k.a. substring as a whole word or any char but a dot.

Upvotes: 2

Krishna
Krishna

Reputation: 481

Try this

(?<=\)\.)[\w\s\(\)]*(?=\.)
  • (?<=\)\.) Is a look behind search to check if ")." is preceded by ).
  • [\w\s\(\)]* To allow all words and white space characters and also ( and ) chracters.
  • (?=\.) Is a look-ahead search to check for chracter .

You can test it here

enter image description here

Upvotes: 0

Related Questions