Reputation: 5958
I have a block of html code in a file. I have a pattern I'm using in a very large html file. I'm looking for all text with ba and sqft However, there are entries that do not have the sqft and I'd like to capture that as 0. Also in my code it captures the ba but in the demo it doesn't.
REGEX ([0-9]+) ba<[\S\s]?>(.) sqft
demo: https://regex101.com/r/Nr6lEo/1
desire: '2, 1,596', '4, 0', '2, 2,376'
actual: '2, 1,596', '2, 2,376'
Upvotes: 2
Views: 42
Reputation: 626748
If you have access only to the corrupt HTML like the one you showed in the regex demo, you can use
\b(\d+)\s*ba<(?:\S*(?:\s(?!ba<|sqft<)\S*)*>(\d[,.\d]*)\s*sqft\b)?
See the regex demo. Note you will need to use some code to assign the default 0
when sqft
part is missing because the regex itself cannot set default values, it only returns the text that exists in the string.
The regex is this long to avoid too much backtracking (since the file is long). It matches
\b
- a word boundary(\d+)
- Group 1: one or more digits\s*
- zero or more whitespacesba<
- ba<
string(?:\S*(?:\s(?!ba<|sqft<)\S*)*>(\d[,.\d]*)\s*sqft\b)?
- an optional sequence of:
\S*
- zero or more non-whitespaces(?:\s(?!ba<|sqft<)\S*)*
- zero or more sequences of a whitespace that is not immediately followed with ba<
and sqft<
and then any zero or more non-whitespace chars>
- a >
char(\d[,.\d]*)
- Group 2: a digit and then zero or more digits, .
or ,
chars\s*sqft\b
- zero or more whitespaces, sqft
and a word boundary.So, the Python code might look like
import re
html = """1.45;" >1234 St1 St</p> <p class=3D"highlight-address" styl=\ne=3D"margin: 0; font-family: 'Montserrat', sans-serif; text-decoration: non=\ne; color: #323232; font-weight: 500; font-size: 13px; line-height: 1.45;" =\n>City, ST 12345</p><p class=3D"highlight-specs" style=3D"margin: 0; fo=\nnt-family: 'Montserrat', sans-serif; text-decoration: none; color: #323232;=\n font-weight: 500; font-size: 13px; line-height: 1.45;"><span style=3D"disp=\nlay: inline-block;">4 bd</span><span style=3D"display: inline-block;"> =\n;=E2=80=A2 </span><span style=3D"display: inline-block;">2 ba</span><s=\npan style=3D"display: inline-block;"> =E2=80=A2 </span><span styl=\ne=3D"display: inline-block;">1,596 sqft</span></p><p class=3D"highlight-spe=\ncs" style=3D"margin: 0; font-family: 'Montserrat', sans-serif; textdecorati=\n\n1.45;" >5678 St2 Rd</p> <p class=3D"highlight-address" sty=\nle=3D"margin: 0; font-family: 'Montserrat', sans-serif; text-decoration: no=\nne; color: #323232; font-weight: 500; font-size: 13px; line-height: 1.45;" =\n >City2, ST 6789</p><p class=3D"highlight-specs" style=3D"margin: 0=\n; font-family: 'Montserrat', sans-serif; text-decoration: none; color: #323=\n232; font-weight: 500; font-size: 13px; line-height: 1.45;"><span style=3D"=\ndisplay: inline-block;">4 bd</span><span style=3D"display: inline-block;">&=\nnbsp;=E2=80=A2 </span><span style=3D"display: inline-block;">4 ba</spa=\nn></p><p class=3D"highlight-specs" style=3D"margin: 0; font-family: 'Montse=\nrrat', sans-serif; textdecoration: none; color: #323232; font-weight: 500; =\n\n1.45;" >91011 St3 Rd</p> <p class=3D"highlight-address" style=\n=3D"margin: 0; font-family: 'Montserrat', sans-serif; text-decoration: none=\n; color: #323232; font-weight: 500; font-size: 13px; line-height: 1.45;" >=\nCity2, ST 10111</p><p class=3D"highlight-specs" style=3D"margi=\nn: 0; font-family: 'Montserrat', sans-serif; text-decoration: none; color: =\n#323232; font-weight: 500; font-size: 13px; line-height: 1.45;"><span style=\n=3D"display: inline-block;">4 bd</span><span style=3D"display: inline-block=\n;"> =E2=80=A2 </span><span style=3D"display: inline-block;">2 ba<=\n/span><span style=3D"display: inline-block;"> =E2=80=A2 </span><s=\npan style=3D"display: inline-block;">2,376 sqft</span></p><p class=3D"highl=\night-specs" style=3D"margin: 0; font-family: 'Montserrat', sans-serif; text="""
pattern = r"\b(\d+)\s*ba<(?:\S*(?:\s(?!ba<|sqft<)\S*)*>(\d[,.\d]*)\s*sqft\b)?"
matches = re.findall(pattern, html)
for ba,sqft in matches:
if not sqft:
sqft = "0"
print(f"{ba}, {sqft}")
Output:
2, 1,596
4, 0
2, 2,376
Upvotes: 1