Raghav Gupta
Raghav Gupta

Reputation: 464

Regex for headings and sub headings in Python to get structured output

I have a string which looks like :

string = ['1.7  DELIVERY, STORAGE AND HANDLING \n', ' \n', 'A.  Delivery and Acceptance Requirements: \n', '1.  Do not deliver items to the site, until all specified submittals have been \n', 'submitted to, and approved by, the Architect. \n', '2.  Deliver materials in original packages, containers or bundles bearing brand name \n', 'and identification of manufacturer or supplier. \n', ' \n', 'B.  Storage and Handling Requirements: \n', "1.  Store and handle materials following manufacturer's recommended procedures, \n", 'and in accordance with material safety data sheets. \n', '2.  Protect materials from damage due to moisture, direct sunlight, excessive \n', 'temperatures, surface contamination, corrosion and damage from \n', 'construction operations and other causes. \n', ' \n', 'C.  Damaged material: Remove any damaged or contaminated materials from job site \n', 'immediately, including materials in packages containing water marks, or show \n', 'evidence of mold. \n', ' \n']

I want to extract sections with alphabets (A-Z) and their coresponding sub-sections with numbers (can range between 1 and 20). I have wrote a script that extracts section as -

regex=r"\b([A-Z]\s*\.\s*)\b"
for index,new_string in enumerate(string):
    match=re.search(regex, new_string)
    if match:
        print(index)

The problem is I'm also getting unwanted search words in that specific section. For example, the string below starts from section 'A' but is taking 'B' as a section as well.

"A. General: Notify the Architect B. where conflicts apply between referenced standards  and existing materials, and existing methods of construction. \n"

I want output in the form of dictionary with keys as sections and values as sub sections. Also, I want to join the sections and sub sections as sometimes they get carried over to the next string due to OCR output. Also '\n' as elements in the list has no significance. Sometimes they are there in abundance, sometimes not there. So I want regex to search sections as alphabets and sub sections as numbers only!.

Example output -

{
'A.  Delivery and Acceptance Requirements: ' : ["1.  Do not deliver items to the site, until all specified submittals have been submitted to, and approved by, the Architect. \n","2.  Deliver materials in original packages, containers or bundles bearing brand name and identification of manufacturer or supplier."]

'B.  Storage and Handling Requirements: ' : ["1.  Store and handle materials following manufacturer's recommended procedures, and in accordance with material safety data sheets. ", and so on..]

}

Upvotes: 1

Views: 230

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627087

You can use

import re
string_list = ['1.7  DELIVERY, STORAGE AND HANDLING \n', ' \n', 'A.  Delivery and Acceptance Requirements: \n', '1.  Do not deliver items to the site, until all specified submittals have been \n', 'submitted to, and approved by, the Architect. \n', '2.  Deliver materials in original packages, containers or bundles bearing brand name \n', 'and identification of manufacturer or supplier. \n', ' \n', 'B.  Storage and Handling Requirements: \n', "1.  Store and handle materials following manufacturer's recommended procedures, \n", 'and in accordance with material safety data sheets. \n', '2.  Protect materials from damage due to moisture, direct sunlight, excessive \n', 'temperatures, surface contamination, corrosion and damage from \n', 'construction operations and other causes. \n', ' \n', 'C.  Damaged material: Remove any damaged or contaminated materials from job site \n', 'immediately, including materials in packages containing water marks, or show \n', 'evidence of mold. \n', ' \n']
section_found = False
result = {}
items = []
current_section = ''
for s in string_list:
    if not s.strip():
        continue
    if re.match(r'[A-Z]\s*\.\s*\b', s):
        if items:
            result[current_section] = items
            items = []
        current_section = s.rstrip()
        section_found = True
    elif section_found and not re.match(r'\d+\s*\.', s):
        if items:
            items[-1] = f'{items[-1].rstrip()} {s}'
    elif section_found:
        items.append(s)
    
import json
print(json.dumps(result, sort_keys=True, indent=4))

See the Python demo.

Output:

{
    "A.  Delivery and Acceptance Requirements:": [
        "1.  Do not deliver items to the site, until all specified submittals have been submitted to, and approved by, the Architect. \n",
        "2.  Deliver materials in original packages, containers or bundles bearing brand name and identification of manufacturer or supplier. \n"
    ],
    "B.  Storage and Handling Requirements:": [
        "1.  Store and handle materials following manufacturer's recommended procedures, and in accordance with material safety data sheets. \n",
        "2.  Protect materials from damage due to moisture, direct sunlight, excessive temperatures, surface contamination, corrosion and damage from construction operations and other causes. \n"
    ]
}

Here,

  • if re.match(r'[A-Z]\s*\.\s*\b', s): checks if the item is the section
  • re.match(r'\d+\s*\.', s) checks if the item is the subsection item

If a section string is found, a new dictionary item is created if items are present. If the section is found and the line is not matching the subsection pattern, it is appended to the rstripped last items item. Else, the subsection is added to the items list.

Upvotes: 1

Related Questions