Reputation: 464
I have a string which looks like :
string = ['1.7 DELIVERY, STORAGE AND HANDLING \n', ' \n', 'A. Delivery and Acceptance Requirements: \n', '1. Do not deliver items to the site, until all specified submittals have been \n', 'submitted to, and approved by, the Architect. \n', '2. Deliver materials in original packages, containers or bundles bearing brand name \n', 'and identification of manufacturer or supplier. \n', ' \n', 'B. Storage and Handling Requirements: \n', "1. Store and handle materials following manufacturer's recommended procedures, \n", 'and in accordance with material safety data sheets. \n', '2. Protect materials from damage due to moisture, direct sunlight, excessive \n', 'temperatures, surface contamination, corrosion and damage from \n', 'construction operations and other causes. \n', ' \n', 'C. Damaged material: Remove any damaged or contaminated materials from job site \n', 'immediately, including materials in packages containing water marks, or show \n', 'evidence of mold. \n', ' \n']
I want to extract sections with alphabets (A-Z) and their coresponding sub-sections with numbers (can range between 1 and 20). I have wrote a script that extracts section as -
regex=r"\b([A-Z]\s*\.\s*)\b"
for index,new_string in enumerate(string):
match=re.search(regex, new_string)
if match:
print(index)
The problem is I'm also getting unwanted search words in that specific section. For example, the string below starts from section 'A' but is taking 'B' as a section as well.
"A. General: Notify the Architect B. where conflicts apply between referenced standards and existing materials, and existing methods of construction. \n"
I want output in the form of dictionary with keys as sections and values as sub sections. Also, I want to join the sections and sub sections as sometimes they get carried over to the next string due to OCR output. Also '\n'
as elements in the list has no significance. Sometimes they are there in abundance, sometimes not there. So I want regex to search sections as alphabets and sub sections as numbers only!.
Example output -
{
'A. Delivery and Acceptance Requirements: ' : ["1. Do not deliver items to the site, until all specified submittals have been submitted to, and approved by, the Architect. \n","2. Deliver materials in original packages, containers or bundles bearing brand name and identification of manufacturer or supplier."]
'B. Storage and Handling Requirements: ' : ["1. Store and handle materials following manufacturer's recommended procedures, and in accordance with material safety data sheets. ", and so on..]
}
Upvotes: 1
Views: 230
Reputation: 627087
You can use
import re
string_list = ['1.7 DELIVERY, STORAGE AND HANDLING \n', ' \n', 'A. Delivery and Acceptance Requirements: \n', '1. Do not deliver items to the site, until all specified submittals have been \n', 'submitted to, and approved by, the Architect. \n', '2. Deliver materials in original packages, containers or bundles bearing brand name \n', 'and identification of manufacturer or supplier. \n', ' \n', 'B. Storage and Handling Requirements: \n', "1. Store and handle materials following manufacturer's recommended procedures, \n", 'and in accordance with material safety data sheets. \n', '2. Protect materials from damage due to moisture, direct sunlight, excessive \n', 'temperatures, surface contamination, corrosion and damage from \n', 'construction operations and other causes. \n', ' \n', 'C. Damaged material: Remove any damaged or contaminated materials from job site \n', 'immediately, including materials in packages containing water marks, or show \n', 'evidence of mold. \n', ' \n']
section_found = False
result = {}
items = []
current_section = ''
for s in string_list:
if not s.strip():
continue
if re.match(r'[A-Z]\s*\.\s*\b', s):
if items:
result[current_section] = items
items = []
current_section = s.rstrip()
section_found = True
elif section_found and not re.match(r'\d+\s*\.', s):
if items:
items[-1] = f'{items[-1].rstrip()} {s}'
elif section_found:
items.append(s)
import json
print(json.dumps(result, sort_keys=True, indent=4))
See the Python demo.
Output:
{
"A. Delivery and Acceptance Requirements:": [
"1. Do not deliver items to the site, until all specified submittals have been submitted to, and approved by, the Architect. \n",
"2. Deliver materials in original packages, containers or bundles bearing brand name and identification of manufacturer or supplier. \n"
],
"B. Storage and Handling Requirements:": [
"1. Store and handle materials following manufacturer's recommended procedures, and in accordance with material safety data sheets. \n",
"2. Protect materials from damage due to moisture, direct sunlight, excessive temperatures, surface contamination, corrosion and damage from construction operations and other causes. \n"
]
}
Here,
if re.match(r'[A-Z]\s*\.\s*\b', s):
checks if the item is the sectionre.match(r'\d+\s*\.', s)
checks if the item is the subsection itemIf a section string is found, a new dictionary item is created if items
are present. If the section is found and the line is not matching the subsection pattern, it is appended to the rstrip
ped last items
item. Else, the subsection is added to the items
list.
Upvotes: 1