Reputation: 536
As a part of a project of my company, I need to extract IP addresses that don't include subnetting (e.g 196.82.1.12/24) from some websites.
If the address contains subnetting, I don't want to grab the part proceeding the subnetting but not taking it at all.
for example on the following case:
<td>212.179.35.154</td>
<td>200.139.97.126/24</td>
<td>"201.139.97.126"</td>
<td>F5 BIG-IP</td>
<td>unknown</td>
<td class="date">26-Feb-2011</td>
The desired output would be:
212.179.35.154
201.139.97.126
Please note that some lines include quotes surrounding the IP address however since there is no following /NUMBER they are valid.
I'm trying to find an appropriate regex for days now such as:
(<td>(\d+\.){3}\d+<\/td>)
\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}[^\/]
However, all seem to have a flaw within them.
Thanks in advance!
Upvotes: 1
Views: 72
Reputation: 36680
For me it looks like task where negative lookahead will be useful. I would do:
import re
txt = '''<td>212.179.35.154</td>
<td>200.139.97.126/24</td>
<td>"201.139.97.126"</td>
<td>F5 BIG-IP</td>
<td>unknown</td>
<td class="date">26-Feb-2011</td>'''
pattern = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(?![0-9/])"
found = re.findall(pattern, txt)
print(found)
Output:
['212.179.35.154', '201.139.97.126']
By using negative lookahead (?![0-9/])
we say: exclude matches if they are followed by 0
or 1
or 2
or 3
or 4
or 5
or 6
or 7
or 8
or 9
or /
. Note that including digits is crucial here, because if you specify only / one of matches would be:
200.139.97.12
(note missing 6
at end)
Upvotes: 2
Reputation: 13413
You can use a negative lookahead assertion, by using the pattern syntax (?!...)
, like this:
import re
s = """
<td>212.179.35.154</td>
<td>200.139.97.126/24</td>
<td>"201.139.97.126"</td>
<td>F5 BIG-IP</td>
<td>unknown</td>
<td class="date">26-Feb-2011</td>
"""
pattern = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(?!\d*\/)"
print(re.findall(pattern,s))
Output:
['212.179.35.154', '201.139.97.126']
The (?!\d*\/)
part tells it "don't match the previous pattern if it is followed by any digits and a forward slash".
(the \d*
part is because otherwise it will match 200.139.97.12
(without the 6
) out of 200.139.97.126/24
)
small note: your original pattern will match more than just legal IP addresses, but I went with your way.
Upvotes: 1