qwerty
qwerty

Reputation: 105

extracting multiple words from a string using regex

I am trying to extract all the references from part of a paper as a list. For now I've just got a paragraph and set it as a string.

I was wondering if it is possible to do this using regex on python? I want to be able to extract multiple words from the string, but so far all I've been able to do is extract the years, singular words, or characters, but not an entire reference at once. Also there are quite a lot of conditions really as the references can vary in format, for example:

text="As shown by Macelroy et al. (1967), bla bla. Podar & Reysenbach (2006) also researched ... Another example is ... (Valdes et al. 2008). Most notably .... Edwards, Bartlett & Stirling (2003)."

So some have the number within a bracket, some are entirely encompassed by brackets, some have multiple capitalised words, some have "et al" and so on. Is it possible to define all of these requirements within one search, and then print these all out together?

I know there are websites or programs I can put the paper into to extract all the references for me, but I would like to know how to do it myself.

Thanks

NB: Edited to clarify how the references would be embedded in the string

Upvotes: 0

Views: 694

Answers (1)

KC.
KC.

Reputation: 3107

import re
t = """
As shown by Macelroy et al. (1967), bla bla. Podar
 & Reysenbach (2006) also researched ... Another example is ... (Valdes et al. 2008). Most notably .... Edwards, Bartlett & Stirling (2003).
"""
f = ["".join(result).replace("(","") for result in re.findall("([A-Z])([^A-Z)]+|[^.,]+)([0-9]{4})",t,re.S)]
print(f)
  1. ([A-Z]) match a block letter
  2. [^A-Z)]+|[^.,]+ match two situation ,

    • match string which without block letter and )
    • match a string which did not contain ,. because if contain , or . may match a whole sentence
  3. [0-9]{4} end with 4 numbers

Upvotes: 1

Related Questions