wuddadid
wuddadid

Reputation: 90

Regular expression to extract chunks of text from a text file?

I need to extract headings and the chunk of text beneath them from a text file in Python using regular expression but I'm finding it difficult.

I converted this PDF to text so that it now looks like this:

img

So far I have been able to get all the numerical headers (12.4.5.4, 12.4.5.6, 13, 13.1, 13.1.1, 13.1.12) using the following regex:

import re

with open('data/single.txt', encoding='UTF-8') as file:

    for line in file:
        headings = re.findall(r'^\d+(?:\.\d+)*\.?', line)
        print(headings)`

I just don't know how to get the worded part of those headings or the paragraph of text beneath them.

EDIT - Here is the text:

I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014

60601-1 © IEC:2005 60601-1 © IEC:2005

– 337 – – 169 –

12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).

Compliance is checked by inspection of the RISK MANAGEMENT FILE.

12.4.6 Diagnostic or therapeutic acoustic pressure When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the RISKS associated with diagnostic or therapeutic acoustic pressure.

Compliance is checked by inspection of the RISK MANAGEMENT FILE.

13 * HAZARDOUS SITUATIONS and fault conditions

13.1 Specific HAZARDOUS SITUATIONS

13.1.1 When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the ME EQUIPMENT.

The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is described in 4.7.

13.1.2 The following HAZARDOUS SITUATIONS shall not occur: – emission of flames, molten metal, poisonous or ignitable substance in hazardous

quantities;

– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired; –

temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when measured as described in 11.1.3; temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be touched, exceeding the allowable values in Table 23 when measured and adjusted as described in 11.1.3;

– exceeding the allowable values for “other components and materials” identified in Table 22 times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31. In all other cases, the allowable values of Table 22 apply.

Temperatures shall be measured using the method described in 11.1.3.

The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of flames, molten metal or ignitable substances, shall not be applied to parts and components where: – The construction or the supply circuit limits the power dissipation in SINGLE FAULT

CONDITION to less than 15 W or the energy dissipation to less than 900 J.

Upvotes: 3

Views: 2192

Answers (4)

Shahad
Shahad

Reputation: 25

import pdfplumber
import re
pdfToString = ""

with pdfplumber.open(r"sample.pdf") as pdf:
    for page in pdf.pages:
        print(page.extract_text())
        pdfToString += page.extract_text()

matches = re.findall(r'^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*',pdfToString, re.M)
for i in  matches:
    if "word_to_extract" in i[:50]:
        print(i)

This solution is to extract all the headings which has same format of headings in the question and to extract the required heading and the paragraphs that follows it.

Upvotes: 0

wuddadid
wuddadid

Reputation: 90

Thanks to their detailed answers and helpful explanations I ended up combining parts of both @The-fourth-bird's code and @Emma's code into this regex which seems to work nicely for what I need.

(^\d+(?:\.\d+)*\s+)((?![a-z])[\s\S].*(?:\r?\n))([\s\S]*?)(?=^\d+(?:\.\d+)*\s+(?![a-z]))

Here is the REGEX DEMO.

I does what I want, which is splitting the (numerical heading), (worded heading) and the (body of text) into groups separated by commas which allow me to separate them into columns in Excel by using the custom delimiter ), ( and some other post processing.

The nice thing about this new regex is that it skips numbered headings that are just references and not actually headings as seen here:

img

Upvotes: 0

Emma
Emma

Reputation: 27723

Maybe,

^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)

might be somewhat close to get those desired texts that I'm guessing.


Here we'd simply look for lines that'd start with,

^(\d+(?:\.\d+)*)\s+

then, we'd simply collect anything afterwards using

([\s\S]*?)

upto the next line that'd start with,

(?=^\d+(?:\.\d+)*)

Then, we may or may not, depending on how our input may look like, have only one last element left, which we would collect that using this last:

^(\d+(?:\.\d+)*)\s+([\s\S]*)

which we would then alter (using |) to the prior expression.

Even though, this method is simple to code, it's pretty slow performance-wise since we're using lookarounds, so the other answer here is much better, if time complexity would be a concern, which is likely to be.

Demo 1

Test

import re

regex = r"^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)"
string = """

I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014

60601-1 © IEC:2005 
60601-1 © IEC:2005

– 337 – 
– 169 –

12.4.5.4  Other ME EQUIPMENT producing diagnostic or therapeutic radiation 
When  applicable,  the  MANUFACTURER  shall  address  in  the  RISK  MANAGEMENT PROCESS  the 
RISKS associated  with  ME EQUIPMENT  producing  diagnostic or therapeutic radiation  other  than 
for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3). 

Compliance is checked by inspection of the RISK MANAGEMENT FILE.

12.4.6  Diagnostic or therapeutic acoustic pressure 
When  applicable,  the  MANUFACTURER  shall  address  in  the  RISK  MANAGEMENT PROCESS  the 
RISKS associated with diagnostic or therapeutic acoustic pressure. 

Compliance is checked by inspection of the RISK MANAGEMENT FILE.

13  *  HAZARDOUS SITUATIONS and fault conditions

13.1  Specific HAZARDOUS SITUATIONS

*  General 

13.1.1 
When  applying  the  SINGLE  FAULT  CONDITIONS  as  described  in  4.7  and listed  in  13.2,  one  at  a 
time,  none  of  the  HAZARDOUS  SITUATIONS  in  13.1.2  to  13.1.4  (inclusive)  shall  occur  in  the 
ME EQUIPMENT.

The failure of any one component at a time, which could result in a  HAZARDOUS  SITUATION, is 
described in 4.7. 

*  Emissions, deformation of ENCLOSURE or exceeding maximum temperature 

13.1.2 
The following HAZARDOUS SITUATIONS shall not occur: 
–  emission  of  flames,  molten  metal,  poisonous  or  ignitable  substance  in  hazardous 

quantities; 

–  deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired; 
– 

temperatures  of  APPLIED  PARTS exceeding  the  allowed  values  identified  in  Table  24  when 
measured as described in 11.1.3; 
temperatures  of  ME EQUIPMENT  parts  that  are  not  APPLIED  PARTS but  are  likely  to  be 
touched,  exceeding  the  allowable  values  in  Table  23  when  measured  and  adjusted  as 
described in 11.1.3; 

– 

–  exceeding the allowable values for “other components and materials” identified in Table 22 
times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31. 
In all other cases, the allowable values of Table 22 apply. 

Temperatures shall be measured using the method described in 11.1.3. 

The  SINGLE  FAULT  CONDITIONS  in  4.7,  8.1 b),  8.7.2  and  13.2.2,  with  regard  to  the  emission  of 
flames,  molten  metal  or  ignitable  substances,  shall  not  be  applied  to  parts  and  components 
where: 
–  The  construction  or  the  supply  circuit  limits  the  power  dissipation  in  SINGLE  FAULT 

CONDITION to less than 15 W or the energy dissipation to less than 900 J. 

"""

print(re.findall(regex, string, re.M))

Output

[('12.4.5.4', 'Other ME EQUIPMENT producing diagnostic or therapeutic radiation \nWhen applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the \nRISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than \nfor diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3). \n\nCompliance is checked by inspection of the RISK MANAGEMENT FILE.\n\n', '', ''), ('12.4.6', 'Diagnostic or therapeutic acoustic pressure \nWhen applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the \nRISKS associated with diagnostic or therapeutic acoustic pressure. \n\nCompliance is checked by inspection of the RISK MANAGEMENT FILE.\n\n', '', ''), ('13', '* HAZARDOUS SITUATIONS and fault conditions\n\n', '', ''), ('13.1', 'Specific HAZARDOUS SITUATIONS\n\n* General \n\n', '', ''), ('13.1.1', 'When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a \ntime, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the \nME EQUIPMENT.\n\nThe failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is \ndescribed in 4.7. \n\n* Emissions, deformation of ENCLOSURE or exceeding maximum temperature \n\n', '', ''), ('', '', '13.1.2', 'The following HAZARDOUS SITUATIONS shall not occur: \n– emission of flames, molten metal, poisonous or ignitable substance in hazardous \n\nquantities; \n\n– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired; \n– \n\ntemperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when \nmeasured as described in 11.1.3; \ntemperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be \ntouched, exceeding the allowable values in Table 23 when measured and adjusted as \ndescribed in 11.1.3; \n\n– \n\n– exceeding the allowable values for “other components and materials” identified in Table 22 \ntimes 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31. \nIn all other cases, the allowable values of Table 22 apply. \n\nTemperatures shall be measured using the method described in 11.1.3. \n\nThe SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of \nflames, molten metal or ignitable substances, shall not be applied to parts and components \nwhere: \n– The construction or the supply circuit limits the power dissipation in SINGLE FAULT \n\nCONDITION to less than 15 W or the energy dissipation to less than 900 J. \n\n')]

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163207

You could use your pattern and match a space after it followed by the rest of the line.

Then repeat matching all following lines that do not start with a heading.

^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*
  • ^\d+(?:.\d+)* Your pattern to match a heading followed by a space
  • .* Match any char except a newline 0+ times
  • (?: Non capturing group
    • \r?\n Match a newline
    • (?! Negative lookahead, assert what is directly to the right is not
      • \d+(?:.\d+)* The heading pattern
    • ) Close lookahead
    • .* Match any char except a newline 0+ times
  • )* Close the non capturing group and repeat 0+ times to match all the lines

Regex demo

Upvotes: 1

Related Questions