Kevin
Kevin

Reputation: 3239

Regex - Get text between two strings

I have a large text file which contains many abstracts (7k of them). I want to separate them. They have the following properties:

a number at the begining with a period right after

123.

and it always ends in:

[PubMed - indexed for MEDLINE]

It would be even better if I can get the title and abstract out of the separated string. I am fine if I have to split the articles first then split the texts.

In the example the title is the third line:

Effects of propofol and isoflurane on haemodynamics and the inflammatory response in cardiopulmonary bypass surgery.

The abstract is on the 8th line:

Cardiopulmonary bypass (CPB) causes reperfusion injury...

I have tried to use the following code for this text

Regex:

[0-9\.]*\s*(((?![0-9\.]*|MEDLINE).)+)\s*MEDLINE

Text:

1. Br J Biomed Sci. 2015;72(3):93-101.

Effects of propofol and isoflurane on haemodynamics and the inflammatory response
in cardiopulmonary bypass surgery.

Sayed S, Idriss NK, Sayyedf HG, Ashry AA, Rafatt DM, Mohamed AO, Blann AD.

Cardiopulmonary bypass (CPB) causes reperfusion injury that when most severe is
clinically manifested as a systemic inflammatory response syndrome. The
anaesthetic propofol may have anti-inflammatory properties that may reduce such a
response. We hypothesised differing effects of propofol and isoflurane on
inflammatory markers in patients having CBR Forty patients undergoing elective
CPB were randomised to receive either propofol or isoflurane for maintenance of
anaesthesia. CRP, IL-6, IL-8, HIF-1α (ELISA), CD11 and CD18 expression (flow
cytometry), and haemoxygenase (HO-1) promoter polymorphisms (PCR/electrophoresis)
were measured before anaesthetic induction, 4 hours post-CPB, and 24 hours later.
There were no differences in the 4 hours changes in CRP, IL-6, IL-8 or CD18
between the two groups, but those in the propofol group had higher HIF-1α (P =
0.016) and lower CD11 expression (P = 0.026). After 24 hours, compared to the
isoflurane group, the propofol group had significantly lower levels of CRP (P <
0.001), IL-6 (P < 0.001) and IL-8 (P < 0.001), with higher levels CD11 (P =
0.009) and CD18 (P = 0.002) expression. After 24 hours, patients on propofol had 
increased expression of shorter HO-1 GT(n) repeats than patients on isoflurane (P
= 0.001). Use of propofol in CPB is associated with a less adverse inflammatory
profile than is isofluorane, and an increased up-regulation of HO-1. This
supports the hypothesis that propofol has anti-inflammatory activity.

PMID: 26510263  [PubMed - indexed for MEDLINE]

Upvotes: 0

Views: 180

Answers (2)

Stephan
Stephan

Reputation: 43013

Two useful solutions have been proposed by Mariano and stribizhev:

Mariano's solution: Use the split method with the typical end

(?m)\[PubMed - indexed for MEDLINE\]$

DEMO : http://ideone.com/Qw5ss2

Java 4+

stribizhev's solution: Fully extract data from the text

(?m)^\s*\d+\..*\R{2}                 # Get to the title
(?<title>[^\n]*(?:\n(?!\n)[^\n]*)*)  # Get title
\R{2}                                # Get to the authors
[^\n]*(?:\n(?!\R)[^\R]*)*            # Consume authors
(?<abstract>[^\[]*(?:\[(?!PubMed[ ]-[ ]indexed[ ]for[ ]MEDLINE\])[^\[]*)*) #Grab abstract

DEMO: https://regex101.com/r/sG2yQ2/2

Java 8+

Upvotes: 1

Dmitry
Dmitry

Reputation: 1283

Try this:

"^[0-9]+\..*\s+(.*)\s+.*\s+((?:\s|.)*?)\[PubMed - indexed for MEDLINE\]"

First group would be title. Second would be abstract.

Upvotes: 1

Related Questions