Reputation: 90
I need to extract headings and the chunk of text beneath them from a text file in Python using regular expression but I'm finding it difficult.
I converted this PDF to text so that it now looks like this:
So far I have been able to get all the numerical headers (12.4.5.4, 12.4.5.6, 13, 13.1, 13.1.1, 13.1.12) using the following regex:
import re
with open('data/single.txt', encoding='UTF-8') as file:
for line in file:
headings = re.findall(r'^\d+(?:\.\d+)*\.?', line)
print(headings)`
I just don't know how to get the worded part of those headings or the paragraph of text beneath them.
EDIT - Here is the text:
I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014
60601-1 © IEC:2005 60601-1 © IEC:2005
– 337 – – 169 –
12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
12.4.6 Diagnostic or therapeutic acoustic pressure When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the RISKS associated with diagnostic or therapeutic acoustic pressure.
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
13 * HAZARDOUS SITUATIONS and fault conditions
13.1 Specific HAZARDOUS SITUATIONS
13.1.1 When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the ME EQUIPMENT.
The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is described in 4.7.
13.1.2 The following HAZARDOUS SITUATIONS shall not occur: – emission of flames, molten metal, poisonous or ignitable substance in hazardous
quantities;
– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired; –
temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when measured as described in 11.1.3; temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be touched, exceeding the allowable values in Table 23 when measured and adjusted as described in 11.1.3;
–
– exceeding the allowable values for “other components and materials” identified in Table 22 times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31. In all other cases, the allowable values of Table 22 apply.
Temperatures shall be measured using the method described in 11.1.3.
The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of flames, molten metal or ignitable substances, shall not be applied to parts and components where: – The construction or the supply circuit limits the power dissipation in SINGLE FAULT
CONDITION to less than 15 W or the energy dissipation to less than 900 J.
Upvotes: 3
Views: 2192
Reputation: 25
import pdfplumber
import re
pdfToString = ""
with pdfplumber.open(r"sample.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
pdfToString += page.extract_text()
matches = re.findall(r'^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*',pdfToString, re.M)
for i in matches:
if "word_to_extract" in i[:50]:
print(i)
This solution is to extract all the headings which has same format of headings in the question and to extract the required heading and the paragraphs that follows it.
Upvotes: 0
Reputation: 90
Thanks to their detailed answers and helpful explanations I ended up combining parts of both @The-fourth-bird's code and @Emma's code into this regex which seems to work nicely for what I need.
(^\d+(?:\.\d+)*\s+)((?![a-z])[\s\S].*(?:\r?\n))([\s\S]*?)(?=^\d+(?:\.\d+)*\s+(?![a-z]))
Here is the REGEX DEMO.
I does what I want, which is splitting the (numerical heading), (worded heading) and the (body of text) into groups separated by commas which allow me to separate them into columns in Excel by using the custom delimiter ), ( and some other post processing.
The nice thing about this new regex is that it skips numbered headings that are just references and not actually headings as seen here:
Upvotes: 0
Reputation: 27723
Maybe,
^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)
might be somewhat close to get those desired texts that I'm guessing.
Here we'd simply look for lines that'd start with,
^(\d+(?:\.\d+)*)\s+
then, we'd simply collect anything afterwards using
([\s\S]*?)
upto the next line that'd start with,
(?=^\d+(?:\.\d+)*)
Then, we may or may not, depending on how our input may look like, have only one last element left, which we would collect that using this last:
^(\d+(?:\.\d+)*)\s+([\s\S]*)
which we would then alter (using |
) to the prior expression.
Even though, this method is simple to code, it's pretty slow performance-wise since we're using lookarounds, so the other answer here is much better, if time complexity would be a concern, which is likely to be.
import re
regex = r"^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)"
string = """
I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014
60601-1 © IEC:2005
60601-1 © IEC:2005
– 337 –
– 169 –
12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than
for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
12.4.6 Diagnostic or therapeutic acoustic pressure
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with diagnostic or therapeutic acoustic pressure.
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
13 * HAZARDOUS SITUATIONS and fault conditions
13.1 Specific HAZARDOUS SITUATIONS
* General
13.1.1
When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a
time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the
ME EQUIPMENT.
The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is
described in 4.7.
* Emissions, deformation of ENCLOSURE or exceeding maximum temperature
13.1.2
The following HAZARDOUS SITUATIONS shall not occur:
– emission of flames, molten metal, poisonous or ignitable substance in hazardous
quantities;
– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired;
–
temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when
measured as described in 11.1.3;
temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be
touched, exceeding the allowable values in Table 23 when measured and adjusted as
described in 11.1.3;
–
– exceeding the allowable values for “other components and materials” identified in Table 22
times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31.
In all other cases, the allowable values of Table 22 apply.
Temperatures shall be measured using the method described in 11.1.3.
The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of
flames, molten metal or ignitable substances, shall not be applied to parts and components
where:
– The construction or the supply circuit limits the power dissipation in SINGLE FAULT
CONDITION to less than 15 W or the energy dissipation to less than 900 J.
"""
print(re.findall(regex, string, re.M))
[('12.4.5.4', 'Other ME EQUIPMENT producing diagnostic or therapeutic radiation \nWhen applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the \nRISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than \nfor diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3). \n\nCompliance is checked by inspection of the RISK MANAGEMENT FILE.\n\n', '', ''), ('12.4.6', 'Diagnostic or therapeutic acoustic pressure \nWhen applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the \nRISKS associated with diagnostic or therapeutic acoustic pressure. \n\nCompliance is checked by inspection of the RISK MANAGEMENT FILE.\n\n', '', ''), ('13', '* HAZARDOUS SITUATIONS and fault conditions\n\n', '', ''), ('13.1', 'Specific HAZARDOUS SITUATIONS\n\n* General \n\n', '', ''), ('13.1.1', 'When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a \ntime, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the \nME EQUIPMENT.\n\nThe failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is \ndescribed in 4.7. \n\n* Emissions, deformation of ENCLOSURE or exceeding maximum temperature \n\n', '', ''), ('', '', '13.1.2', 'The following HAZARDOUS SITUATIONS shall not occur: \n– emission of flames, molten metal, poisonous or ignitable substance in hazardous \n\nquantities; \n\n– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired; \n– \n\ntemperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when \nmeasured as described in 11.1.3; \ntemperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be \ntouched, exceeding the allowable values in Table 23 when measured and adjusted as \ndescribed in 11.1.3; \n\n– \n\n– exceeding the allowable values for “other components and materials” identified in Table 22 \ntimes 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31. \nIn all other cases, the allowable values of Table 22 apply. \n\nTemperatures shall be measured using the method described in 11.1.3. \n\nThe SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of \nflames, molten metal or ignitable substances, shall not be applied to parts and components \nwhere: \n– The construction or the supply circuit limits the power dissipation in SINGLE FAULT \n\nCONDITION to less than 15 W or the energy dissipation to less than 900 J. \n\n')]
Upvotes: 1
Reputation: 163207
You could use your pattern and match a space after it followed by the rest of the line.
Then repeat matching all following lines that do not start with a heading.
^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*
^\d+(?:.\d+)*
Your pattern to match a heading followed by a space.*
Match any char except a newline 0+ times(?:
Non capturing group
\r?\n
Match a newline(?!
Negative lookahead, assert what is directly to the right is not
\d+(?:.\d+)*
The heading pattern)
Close lookahead.*
Match any char except a newline 0+ times)*
Close the non capturing group and repeat 0+ times to match all the linesUpvotes: 1