Extract only the title of an article from a TXT file in Python

Question

I would appreciate your guidance in the following problem. I need to bulk extract only the articles titles from a series of publications. The idea is that I receive the files in PDF, I extract only the first page (done), bulk convert to TXT (done), and I am stuck in the last phase.

The structure of the TXTs is as follows:

--- JOURNAL of MEDICINE and LIFE

JML | REVIEW

The role of novel poly (ADP-ribose) inhibitors in the treatment of locally advanced and metastatic Her-2/neu negative breast cancer with inherited germline BRCA1/2 mutations. A review of the literature

Authors list, etc, etc ---

In need only the title (in bold), from each file. I can do the iteration, that is not a problem.

With the code below I tried to identify paragraph 1:

    data = file.read()
    array1 = []
    sp = data.split("

")
    for number, paragraph in enumerate(sp, 1):
        if number == 1:
            array1 += [paragraph]
            print (array1)

No results whatsoever...

The idea is that I need to save only the titles in a file (could be TXT) as I need this list for another purpose.

Many thanks!

The fourth bird · Accepted Answer

You might read the whole file using .read() and use a pattern with a capture group to match from JML to Authors.

^JML\s*\|.*\s*
?
((?:.*
?
)*?)Authors\b

The pattern matches:

^ Start of string
JML\s*\| match JML, optional whitespace chars and |
.*\s* ? Match the rest of the line, optional whitespace chars and a newline
( Capture group 1
- (?:.* ? )*? Match all lines as least as possible
) Close group 1
Authors\b Authors

Regex demo

For example:

import os
import re

pattern = r"^JML\s*\|.*\s*
?
((?:.*
?
)*?)Authors\b"
array1 = []

for file in os.listdir():
    with open(file, "r") as data:
        array1 = array1 + re.findall(pattern, data.read(), re.MULTILINE)
print(array1)

Extract only the title of an article from a TXT file in Python

Answers (1)

Related Questions