nekovolta
nekovolta

Reputation: 516

Extract data with regular expresion with python

I am trying to extract data from a txt file (see a sample text below) using python. Take into account that the title can be in one single line, split into two lines or even split with a blank line in the middle (TITLE1).

What I would like to achieve is to extract the information to store in a table like this:

Code Title Opening date Deadline Budget
TITLE-SDFSD-DFDS-SFDS-01-01 This is the title 1 that is split in two lines with a blank line in the middle 15-Apr-21 26-Aug-21 EUR 20.00 million
TITLE-SDFSD-DFDS-SFDS-01-02 This is the title2 in one single line 15-Mar-21 17-Aug-21 EUR 15.00 million
TITLE-SDFSD-DFDS-SFDS-01-03 This is the title3 that is too long and takes two lines 15-May-21 26-Sep-21 EUR 5.00 million

I manage to identify the "codes titles" with this piece of code:

import re

with open('doubt2.txt','r', encoding="utf-8") as f:
    f_contents = f.read() 
    
pattern = re.compile(r'TITLE-.+-[0-9]{2}-[0-9]{2}(?!,)\S{1}')
matches = pattern.finditer(f_contents)

for match in matches:
    print(match)

And I get this result:

<re.Match object; span=(160, 188), match='TITLE-SDFSD-DFDS-SFDS-01-01:'>
<re.Match object; span=(669, 697), match='TITLE-SDFSD-DFDS-SFDS-01-02;'>
<re.Match object; span=(1066, 1094), match='TITLE-SDFSD-DFDS-SFDS-01-03:'>

My doubt is how to get the information that I identified with the regular expression and extract the rest of the data. Can you help me, please?

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam id diam posuere, eleifend diam at, condimentum justo. Pellentesque mollis a diam id consequat.

TITLE-SDFSD-DFDS-SFDS-01-01: This is the title 1 that

is split into two lines with a blank line in the middle

Conditions Pellentesque blandit scelerisque pellentesque. Sed nec quam purus. Quisque nec tellus sed neque accumsan lacinia sit amet sit amet tellus. Etiam venenatis nibh vel pellentesque elementum. Nullam eget tortor quam. Morbi sed leo et arcu aliquet luctus.

Opening date 15 Apr 2021

Deadline 26 Aug 2021

Indicative budget: The total indicative budget for the topic is EUR 20.00 million.

TITLE-SDFSD-DFDS-SFDS-01-02; This is the title2 in one single line

Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.

Opening date 15 March 2021

Deadline 17 Aug 2021

Indicative budget: The total indicative budget for the topic is EUR 15.00 million.

TITLE-SDFSD-DFDS-SFDS-01-03: This is the title3 that is too long and takes two lines

Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.

Opening date 15 May 2021

Deadline 26 Sep 2021

Indicative budget: The total indicative budget for the topic is EUR 5.00 million.

Upvotes: 2

Views: 726

Answers (2)

Barmar
Barmar

Reputation: 782693

Use a regular expression with capturing groups. USe the re.DOTALL flag to allow .* to match across multiple lines, so you can capture multi-line titles. And use lazy quantifiers to avoid the matches being too long.

import csv
import re

pattern = re.compile(r'^(TITLE-.+?-\d{2}-\d{2})\S*\s*(.*?)^Conditions.*?^Opening date (\d{1,2} \w+ \d{4})\s*?^Deadline (\d{1,2} \w+ \d{4})\s*^Indicative budget:.*?(EUR [\d.]+ \w+)', re.MULTILINE | re.DOTALL)
matches = pattern.finditer(f_contents)

with open("result.csv", "w") as outfile:
    csvfile = csv.writer(outfile)
    csvfile.writerow(['Code', 'Title', 'Opening date', 'Deadline', 'Budget'])
    for match in matches:
        csvfile.writerow([match.group(1), match.group(2).replace('\n', ' '), match.group(3), match.group(4), match.group(5)])

DEMO

Upvotes: 2

The fourth bird
The fourth bird

Reputation: 163632

You could get the matches using capture groups.

Note that you can write (?!,)\S as [^\s,]

Based on the lines in the example:

^(TITLE-.+?-[0-9]{2}-[0-9]{2})[^\s,] (.*(?:\r?\n(?![A-Z]).*)*)(?:\r?\n(?!Opening).*)*\r?\nOpening date (\d+ .*)(?:\r?\n(?!Deadline).*)*\r?\nDeadline (\d+ .*)(?:\r?\n(?!Indicative budget:).*)*\r?\nIndicative budget: .*?(EUR \d+(?:\.\d+)? \w+)

Explanation

  • ^ Start of string
  • (TITLE-.+?-[0-9]{2}-[0-9]{2}) Capture group 1, match the title part
  • [^\s,] Match any non whitespace char except a comma
  • (.*(?:\r?\n(?![A-Z]).*)*) Capture group 2, match all lines that do not start with an uppercase char
  • (?:\r?\n(?!Opening).*)*\r?\nOpening date Match all lines till Opening date
  • (\d+ .*) Capture group 3, match 1+ digits, a space and the rest of the line
  • (?:\r?\n(?!Deadline).*)*\r?\nDeadline Match all lines until Deadline
  • (\d+ .*) Capture group 4, match 1+ digits and the rest of the line
  • (?:\r?\n(?!Indicative budget:).*)*\r?\nIndicative budget: .*? Match all lines until Indicative budget:
  • (EUR \d+(?:\.\d+)? \w+) Capture group 5, match EUR, the number and 1+ word characters

Regex demo | Python demo

Then you could for example load it in a table or dataframe

with open('doubt2.txt','r', encoding="utf-8") as f:
    f_contents = f.read()
    pattern = re.compile(r"^(TITLE-.+?-[0-9]{2}-[0-9]{2})[^\s,] (.*(?:\r?\n(?![A-Z]).*)*)(?:\r?\n(?!Opening).*)*\r?\nOpening date (\d+ .*)(?:\r?\n(?!Deadline).*)*\r?\nDeadline (\d+ .*)(?:\r?\n(?!Indicative budget:).*)*\r?\nIndicative budget: .*?(EUR \d+(?:\.\d+)? \w+)", re.MULTILINE)
    matches = pattern.findall(f_contents)
    df = pd.DataFrame(matches, columns = ['Code', 'Title', 'Opening date', 'Deadline', 'Budget'])
    df['Title'] = df['Title'].str.replace('[\r\n]+',' ')
    print(df)

Output

            Code          Title   Opening date     Deadline         Budget
0  TITLE-SDFS...  This is th...    15 Apr 2021  26 Aug 2021  EUR 20.00 ...
1  TITLE-SDFS...  This is th...  15 March 2021  17 Aug 2021  EUR 15.00 ...
2  TITLE-SDFS...  This is th...    15 May 2021  26 Sep 2021  EUR 5.00 m...

Upvotes: 1

Related Questions