Reputation: 33
I would appreciate your guidance in the following problem. I need to bulk extract only the articles titles from a series of publications. The idea is that I receive the files in PDF, I extract only the first page (done), bulk convert to TXT (done), and I am stuck in the last phase.
The structure of the TXTs is as follows:
--- JOURNAL of MEDICINE and LIFE
JML | REVIEW
The role of novel poly (ADP-ribose) inhibitors in the treatment of locally advanced and metastatic Her-2/neu negative breast cancer with inherited germline BRCA1/2 mutations. A review of the literature
Authors list, etc, etc ---
In need only the title (in bold), from each file. I can do the iteration, that is not a problem.
With the code below I tried to identify paragraph 1:
data = file.read()
array1 = []
sp = data.split("\n\n")
for number, paragraph in enumerate(sp, 1):
if number == 1:
array1 += [paragraph]
print (array1)
No results whatsoever...
The idea is that I need to save only the titles in a file (could be TXT) as I need this list for another purpose.
Many thanks!
Upvotes: 0
Views: 1756
Reputation: 163362
You might read the whole file using .read()
and use a pattern with a capture group to match from JML to Authors.
^JML\s*\|.*\s*\r?\n((?:.*\r?\n)*?)Authors\b
The pattern matches:
^
Start of stringJML\s*\|
match JML, optional whitespace chars and |
.*\s*\r?\n
Match the rest of the line, optional whitespace chars and a newline(
Capture group 1
(?:.*\r?\n)*?
Match all lines as least as possible)
Close group 1Authors\b
AuthorsFor example:
import os
import re
pattern = r"^JML\s*\|.*\s*\r?\n((?:.*\r?\n)*?)Authors\b"
array1 = []
for file in os.listdir():
with open(file, "r") as data:
array1 = array1 + re.findall(pattern, data.read(), re.MULTILINE)
print(array1)
Upvotes: 1