Reputation: 874
What I have: an itemized list I got from scraping a PDF, however some elements of the list are incorrectly distributed across adjacent elements of the list
A = ["1. 100 Test.1; 200 Test.2; 300 ",
"Test.3; 400 Test.4",
"2. 500 Test.5; 600 Test.6;",
"3. 700 Test.7; 800 Test.8; ",
"900 Test.9; 1000 Test.10"]
What I need: a list that starts with the items 1., 2., 3., etc. and to append the other items in the list to the preceding element of the list:
B = ["1. 100 Test.1; 200 Test.2; 300 Test.3; 400 Test.4",
"2. 500 Test.5; 600 Test.6;",
"3. 700 Test.7; 800 Test.8; 900 Test.9; 1000 Test.10"]
What I've tried: what I'm hoping for is a way to identify items in the list that have the format "X.X" but I haven't had much luck. I did write a loop that identifies if the element of the list starts with an integer, however that doesn't help me in cases like the last element of list A. Any help is appreciated.
Upvotes: 1
Views: 58
Reputation: 146
This solution combines the list into a single text string then uses re.split() in order to find the x.x pattern to split on.
import re
import pprint
A = ["1. 100 Test.1; 200 Test.2; 300 ",
"Test.3; 400 Test.4",
"2. 500 Test.5; 600 Test.6;",
"3. 700 Test.7; 800 Test.8; ",
"900 Test.9; 1000 Test.10"]
# Combine list into a single string
text = "".join(A)
# Split the string into list elements based on desired pattern
lines = re.split(r'(\d\.\s)', text)
# Remove any blank lines
lines = [l for l in lines if l.strip()]
# Combine the line numbers and their matching strings back together
numbered_lines = []
for i in range(0, len(lines), 2):
numbered_lines.append(lines[i] + lines[i+1])
# Print the results
pprint.pprint(numbered_lines)
Output:
❯ python main.py
['1. 100 Test.1; 200 Test.2; 300 Test.3; 400 Test.4',
'2. 500 Test.5; 600 Test.6;',
'3. 700 Test.7; 800 Test.8; 900 Test.9; 1000 Test.10']
Update: Added capture group to regex in order to keep line numbers
Upvotes: 1