Reputation: 105
I am trying to extract all the references from part of a paper as a list. For now I've just got a paragraph and set it as a string.
I was wondering if it is possible to do this using regex on python? I want to be able to extract multiple words from the string, but so far all I've been able to do is extract the years, singular words, or characters, but not an entire reference at once. Also there are quite a lot of conditions really as the references can vary in format, for example:
text="As shown by Macelroy et al. (1967), bla bla. Podar & Reysenbach (2006) also researched ... Another example is ... (Valdes et al. 2008). Most notably .... Edwards, Bartlett & Stirling (2003)."
So some have the number within a bracket, some are entirely encompassed by brackets, some have multiple capitalised words, some have "et al" and so on. Is it possible to define all of these requirements within one search, and then print these all out together?
I know there are websites or programs I can put the paper into to extract all the references for me, but I would like to know how to do it myself.
Thanks
NB: Edited to clarify how the references would be embedded in the string
Upvotes: 0
Views: 694
Reputation: 3107
import re
t = """
As shown by Macelroy et al. (1967), bla bla. Podar
& Reysenbach (2006) also researched ... Another example is ... (Valdes et al. 2008). Most notably .... Edwards, Bartlett & Stirling (2003).
"""
f = ["".join(result).replace("(","") for result in re.findall("([A-Z])([^A-Z)]+|[^.,]+)([0-9]{4})",t,re.S)]
print(f)
[^A-Z)]+|[^.,]+ match two situation ,
)
,.
because if contain ,
or .
may match a whole sentence[0-9]{4} end with 4 numbers
Upvotes: 1