Intrastellar Explorer
Intrastellar Explorer

Reputation: 2411

Regex excluding all groups matched by negative lookahead

I have a regex to parse folder and file names out from a block of HTML code, and exclude filenames with the extension .ini.

My current regex: /href="([\w]+)(\.[\w]+)*/ig

  1. Matches group one: 1+ word characters
  2. Matches group two 0+ times: . then 1+ word characters
  3. Flags: match case insensitive and as many as possible

I have tried to use negative lookahead (what I think is the proper solution) time and time again to remove a match if it has the extension .ini. Sadly, I have failed my mission, and here I am. I chose not to include my attempts above because it would just pollute the question


From reading all over the internet:

To restate:

I could figure out how to ignore just the .ini group, but could not figure out how to get the regex to ignore all groups. Can you please help me figure out the proper regex?


Sample Input String

A sample block of HTML code that I test the regex with.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /images/AAVS</title>
 </head>
 <body>
<h1>Index of /images/AAVS</h1>
  <table>
   <tr><th valign="top"><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>
   <tr><th colspan="5"><hr></th></tr>
<tr><td valign="top"><img src="/icons/back.gif" alt="[PARENTDIR]"></td><td><a href="/images/">Parent Directory</a>       </td><td>&nbsp;</td><td align="right">  - </td><td>&nbsp;</td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="20190823/">20190823/</a>              </td><td align="right">2019-09-19 19:37  </td><td align="right">  - </td><td>&nbsp;</td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="20190826/">20190826/</a>              </td><td align="right">2019-09-19 19:31  </td><td align="right">  - </td><td>&nbsp;</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="desktop.ini">desktop.ini</a>            </td><td align="right">2019-09-19 19:24  </td><td align="right">136 </td><td>&nbsp;</td></tr>
   <tr><th colspan="5"><hr></th></tr>
</table>
</body></html>

Also, I would like to say that I am sure there is a much better approach. All critique is welcome!

Upvotes: 1

Views: 1168

Answers (1)

Booboo
Booboo

Reputation: 44128

The regex is (?<=href=")[^"]+(?<!\.ini)(?=")

  1. (?<=href=") Positive lookbehind of href="
  2. [^"]+ Match as many non-double quote characters as you can
  3. (?<!\.ini) Negative lookbehind of .ini
  4. (?=") Positive lookahead of a double quote

The code:

import re

html = """<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /images/AAVS</title>
 </head>
 <body>
<h1>Index of /images/AAVS</h1>
  <table>
   <tr><th valign="top"><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>
   <tr><th colspan="5"><hr></th></tr>
<tr><td valign="top"><img src="/icons/back.gif" alt="[PARENTDIR]"></td><td><a href="/images/">Parent Directory</a>       </td><td>&nbsp;</td><td align="right">  - </td><td>&nbsp;</td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="20190823/">20190823/</a>              </td><td align="right">2019-09-19 19:37  </td><td align="right">  - </td><td>&nbsp;</td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="20190826/">20190826/</a>              </td><td align="right">2019-09-19 19:31  </td><td align="right">  - </td><td>&nbsp;</td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td><td><a href="desktop.ini">desktop.ini</a>            </td><td align="right">2019-09-19 19:24  </td><td align="right">136 </td><td>&nbsp;</td></tr>
   <tr><th colspan="5"><hr></th></tr>
</table>
</body></html>"""

l = re.findall(r'(?<=href=")[^"]+(?<!\.ini)(?=")', html, flags=re.I)
print(l)

Prints:

['?C=N;O=D', '?C=M;O=A', '?C=S;O=A', '?C=D;O=A', '/images/', '20190823/', '20190826/']

The above regex will accept any href value and that is why it returned values such as '?C=N;O=D'. If you wish to restrict it to values that are make up file and folder names that you are specifically looking for, you might use a more restrictive regex such as:

(?<=href=")[a-z0-9_./-]+(?<!\.ini)(?=")

This will result in printing:

['/images/', '20190823/', '20190826/']

But, in fact, based on my research, ?C=N;O=D would be a legal filename in the Linux file system.

You can even accomplish the task without using lookbehind or lookahead:

l = [m.group(1) for m in re.finditer(r'(?:href=")([^"]+)(?:")', html, flags=re.I) if not m.group(1).lower().endswith(".ini")]

Upvotes: 1

Related Questions