Reputation: 2411
I have a regex to parse folder and file names out from a block of HTML code, and exclude filenames with the extension .ini
.
My current regex: /href="([\w]+)(\.[\w]+)*/ig
.
then 1+ word charactersI have tried to use negative lookahead (what I think is the proper solution) time and time again to remove a match if it has the extension .ini
. Sadly, I have failed my mission, and here I am. I chose not to include my attempts above because it would just pollute the question
From reading all over the internet:
To restate:
.ini
, and then if it matches, exclude all groups from that match.I could figure out how to ignore just the .ini
group, but could not figure out how to get the regex to ignore all groups. Can you please help me figure out the proper regex?
Sample Input String
A sample block of HTML code that I test the regex with.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Index of /images/AAVS</title>
</head>
<body>
<h1>Index of /images/AAVS</h1>
<table>
<tr><th valign="top"><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>
<tr><th colspan="5"><hr></th></tr>
<tr><td valign="top"><img src="/icons/back.gif" alt="[PARENTDIR]"></td><td><a href="/images/">Parent Directory</a> </td><td> </td><td align="right"> - </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="20190823/">20190823/</a> </td><td align="right">2019-09-19 19:37 </td><td align="right"> - </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="20190826/">20190826/</a> </td><td align="right">2019-09-19 19:31 </td><td align="right"> - </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="desktop.ini">desktop.ini</a> </td><td align="right">2019-09-19 19:24 </td><td align="right">136 </td><td> </td></tr>
<tr><th colspan="5"><hr></th></tr>
</table>
</body></html>
Also, I would like to say that I am sure there is a much better approach. All critique is welcome!
Upvotes: 1
Views: 1168
Reputation: 44128
The regex is (?<=href=")[^"]+(?<!\.ini)(?=")
(?<=href=")
Positive lookbehind of href="
[^"]+
Match as many non-double quote characters as you can(?<!\.ini)
Negative lookbehind of .ini
(?=")
Positive lookahead of a double quoteThe code:
import re
html = """<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Index of /images/AAVS</title>
</head>
<body>
<h1>Index of /images/AAVS</h1>
<table>
<tr><th valign="top"><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>
<tr><th colspan="5"><hr></th></tr>
<tr><td valign="top"><img src="/icons/back.gif" alt="[PARENTDIR]"></td><td><a href="/images/">Parent Directory</a> </td><td> </td><td align="right"> - </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="20190823/">20190823/</a> </td><td align="right">2019-09-19 19:37 </td><td align="right"> - </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="20190826/">20190826/</a> </td><td align="right">2019-09-19 19:31 </td><td align="right"> - </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="desktop.ini">desktop.ini</a> </td><td align="right">2019-09-19 19:24 </td><td align="right">136 </td><td> </td></tr>
<tr><th colspan="5"><hr></th></tr>
</table>
</body></html>"""
l = re.findall(r'(?<=href=")[^"]+(?<!\.ini)(?=")', html, flags=re.I)
print(l)
Prints:
['?C=N;O=D', '?C=M;O=A', '?C=S;O=A', '?C=D;O=A', '/images/', '20190823/', '20190826/']
The above regex will accept any href
value and that is why it returned values such as '?C=N;O=D'
. If you wish to restrict it to values that are make up file and folder names that you are specifically looking for, you might use a more restrictive regex such as:
(?<=href=")[a-z0-9_./-]+(?<!\.ini)(?=")
This will result in printing:
['/images/', '20190823/', '20190826/']
But, in fact, based on my research, ?C=N;O=D
would be a legal filename in the Linux file system.
You can even accomplish the task without using lookbehind or lookahead:
l = [m.group(1) for m in re.finditer(r'(?:href=")([^"]+)(?:")', html, flags=re.I) if not m.group(1).lower().endswith(".ini")]
Upvotes: 1