Reputation: 3239
I have a large text file which contains many abstracts (7k of them). I want to separate them. They have the following properties:
a number at the begining with a period right after
123.
and it always ends in:
[PubMed - indexed for MEDLINE]
It would be even better if I can get the title and abstract out of the separated string. I am fine if I have to split the articles first then split the texts.
In the example the title is the third line:
Effects of propofol and isoflurane on haemodynamics and the inflammatory response in cardiopulmonary bypass surgery.
The abstract is on the 8th line:
Cardiopulmonary bypass (CPB) causes reperfusion injury...
I have tried to use the following code for this text
Regex:
[0-9\.]*\s*(((?![0-9\.]*|MEDLINE).)+)\s*MEDLINE
Text:
1. Br J Biomed Sci. 2015;72(3):93-101.
Effects of propofol and isoflurane on haemodynamics and the inflammatory response
in cardiopulmonary bypass surgery.
Sayed S, Idriss NK, Sayyedf HG, Ashry AA, Rafatt DM, Mohamed AO, Blann AD.
Cardiopulmonary bypass (CPB) causes reperfusion injury that when most severe is
clinically manifested as a systemic inflammatory response syndrome. The
anaesthetic propofol may have anti-inflammatory properties that may reduce such a
response. We hypothesised differing effects of propofol and isoflurane on
inflammatory markers in patients having CBR Forty patients undergoing elective
CPB were randomised to receive either propofol or isoflurane for maintenance of
anaesthesia. CRP, IL-6, IL-8, HIF-1α (ELISA), CD11 and CD18 expression (flow
cytometry), and haemoxygenase (HO-1) promoter polymorphisms (PCR/electrophoresis)
were measured before anaesthetic induction, 4 hours post-CPB, and 24 hours later.
There were no differences in the 4 hours changes in CRP, IL-6, IL-8 or CD18
between the two groups, but those in the propofol group had higher HIF-1α (P =
0.016) and lower CD11 expression (P = 0.026). After 24 hours, compared to the
isoflurane group, the propofol group had significantly lower levels of CRP (P <
0.001), IL-6 (P < 0.001) and IL-8 (P < 0.001), with higher levels CD11 (P =
0.009) and CD18 (P = 0.002) expression. After 24 hours, patients on propofol had
increased expression of shorter HO-1 GT(n) repeats than patients on isoflurane (P
= 0.001). Use of propofol in CPB is associated with a less adverse inflammatory
profile than is isofluorane, and an increased up-regulation of HO-1. This
supports the hypothesis that propofol has anti-inflammatory activity.
PMID: 26510263 [PubMed - indexed for MEDLINE]
Upvotes: 0
Views: 180
Reputation: 43013
Two useful solutions have been proposed by Mariano and stribizhev:
split
method with the typical end(?m)\[PubMed - indexed for MEDLINE\]$
DEMO : http://ideone.com/Qw5ss2
Java 4+
(?m)^\s*\d+\..*\R{2} # Get to the title
(?<title>[^\n]*(?:\n(?!\n)[^\n]*)*) # Get title
\R{2} # Get to the authors
[^\n]*(?:\n(?!\R)[^\R]*)* # Consume authors
(?<abstract>[^\[]*(?:\[(?!PubMed[ ]-[ ]indexed[ ]for[ ]MEDLINE\])[^\[]*)*) #Grab abstract
DEMO: https://regex101.com/r/sG2yQ2/2
Java 8+
Upvotes: 1
Reputation: 1283
Try this:
"^[0-9]+\..*\s+(.*)\s+.*\s+((?:\s|.)*?)\[PubMed - indexed for MEDLINE\]"
First group would be title. Second would be abstract.
Upvotes: 1