OviSele
OviSele

Reputation: 33

Extract only the title of an article from a TXT file in Python

I would appreciate your guidance in the following problem. I need to bulk extract only the articles titles from a series of publications. The idea is that I receive the files in PDF, I extract only the first page (done), bulk convert to TXT (done), and I am stuck in the last phase.

The structure of the TXTs is as follows:

--- JOURNAL of MEDICINE and LIFE

JML | REVIEW

The role of novel poly (ADP-ribose) inhibitors in the treatment of locally advanced and metastatic Her-2/neu negative breast cancer with inherited germline BRCA1/2 mutations. A review of the literature

Authors list, etc, etc ---

In need only the title (in bold), from each file. I can do the iteration, that is not a problem.

With the code below I tried to identify paragraph 1:

    data = file.read()
    array1 = []
    sp = data.split("\n\n")
    for number, paragraph in enumerate(sp, 1):
        if number == 1:
            array1 += [paragraph]
            print (array1)

No results whatsoever...

The idea is that I need to save only the titles in a file (could be TXT) as I need this list for another purpose.

Many thanks!

Upvotes: 0

Views: 1756

Answers (1)

The fourth bird
The fourth bird

Reputation: 163362

You might read the whole file using .read() and use a pattern with a capture group to match from JML to Authors.

^JML\s*\|.*\s*\r?\n((?:.*\r?\n)*?)Authors\b

The pattern matches:

  • ^ Start of string
  • JML\s*\| match JML, optional whitespace chars and |
  • .*\s*\r?\n Match the rest of the line, optional whitespace chars and a newline
  • ( Capture group 1
    • (?:.*\r?\n)*? Match all lines as least as possible
  • ) Close group 1
  • Authors\b Authors

Regex demo

For example:

import os
import re

pattern = r"^JML\s*\|.*\s*\r?\n((?:.*\r?\n)*?)Authors\b"
array1 = []

for file in os.listdir():
    with open(file, "r") as data:
        array1 = array1 + re.findall(pattern, data.read(), re.MULTILINE)
print(array1)

Upvotes: 1

Related Questions