sahase
sahase

Reputation: 3

How to extract text between two headings with regex, requires complicated non-capture groups

I want to pull abstracts out of a large corpus of scientific papers using a python script. The papers are all saved as strings in a large csv. I want to something like this: extracting text between two headers I can write a regex to find the 'Abstract' heading. However, finding the next section heading is proving difficult. Headers vary wildly from paper to paper. They can be ALL CAPS or Just Capitalized. They can be one word or a long phrase and span two lines. They are usually followed by one-two newlines. This is what I came up with: -->

abst = re.findall(r'(?:ABSTRACT\s*\n+|Abstract\s*\n+)(.*?)((?:[A-Z]+|(?:\n(?:[A-Z]+|(?:[A-Z][a-z]+\s*)+)\n+)',row[0],re.DOTALL)

Here is an example of an abstract:

'...\nAbstract\nFactorial Hidden Markov Models (FHMMs) are powerful models for sequential\ndata but they do not scale well with long sequences. We propose a scalable inference and learning algorithm for FHMMs that draws on ideas from the stochastic\nvariational inference, neural network and copula literatures. Unlike existing approaches, the proposed algorithm requires no message passing procedure among\nlatent variables and can be distributed to a network of computers to speed up learning. Our experiments corroborate that the proposed algorithm does not introduce\nfurther approximation bias compared to the proven structured mean-field algorithm,\nand achieves better performance with long sequences and large FHMMs.\n\n1\n\nIntroduction\n\n...'

So I'm trying to find 'Abstract' and 'Introduction' and pull out the text that is between them. However it could be 'ABSTRACT' and 'INTRODUCTION', or ABSTRACT and 'A SINGLE LAYER NETWORK AND THE MEAN FIELD\nAPPROXIMATION\n'

Help?

Upvotes: 0

Views: 1710

Answers (1)

MBaas
MBaas

Reputation: 7530

Recognizing the next section is a bit vague - perhaps we can rely on Abstract-section ending with two newlines?

ABSTRACT\n(.*)\n\n

Or maybe we'll just assume that the next section-title will start with an uppercase letter and be followed by any number of word-characters. (Also that's rather vague, too, and assumes there'l be no \n\n within the Abstract.

ABSTRACT\n(.*)\n\n\U[\w\s]*\n\n

Maybe that stimulates further fiddling on your end... Feel free to post examples where this did not match - maybe we can stepwise refine it. N.B: as Wiktor pointed out, I could not use the case-insensitive modifiers. So the whole rx should be used with switches for case-insenstive matching.

Update1: the challenge here is really how to identify that a new section has begun...and not to confuse that with paragraph-breaks within the Abstract. Perhaps that can also be dealt with by changing the rather tolerant [\w\s]*with [\w\s]{1,100} which would only recognize text in a new paragraph as a title of the "abstract-successor" if it had between 2 and 100 characters (note: 2 characters, although the limit is set to 1 because of the \U (uppercase character).

ABSTRACT\n(.*)\n\n\U[\w\s]{1,100}\n\n

Upvotes: 1

Related Questions