Regex excluding all groups matched by negative lookahead

Question

I have a regex to parse folder and file names out from a block of HTML code, and exclude filenames with the extension .ini.

My current regex: /href="([\w]+)(\.[\w]+)*/ig

Matches group one: 1+ word characters
Matches group two 0+ times: . then 1+ word characters
Flags: match case insensitive and as many as possible

I have tried to use negative lookahead (what I think is the proper solution) time and time again to remove a match if it has the extension .ini. Sadly, I have failed my mission, and here I am. I chose not to include my attempts above because it would just pollute the question

From reading all over the internet:

Negative Lookahead
Match strings not containing a string: https://www.regextester.com/15
Regular expression for excluding file types .exe and .js

To restate:

What I have is two groups.
What I think I should do is use negative lookahead to match for .ini, and then if it matches, exclude all groups from that match.

I could figure out how to ignore just the .ini group, but could not figure out how to get the regex to ignore all groups. Can you please help me figure out the proper regex?

Sample Input String

A sample block of HTML code that I test the regex with.



 
  Index of /images/AAVS
 
 
Index of /images/AAVS
  
   Name Last modified Size Description
   
Parent Directory           -  
20190823/              2019-09-19 19:37    -  
20190826/              2019-09-19 19:31    -  
desktop.ini            2019-09-19 19:24  136

Also, I would like to say that I am sure there is a much better approach. All critique is welcome!

Booboo · Accepted Answer

The regex is (?<=href=")[^"]+(?




(?<=href=") Positive lookbehind of href="
[^"]+ Match as many non-double quote characters as you can
(? Negative lookbehind of .ini

(?=") Positive lookahead of a double quote


The code:

import re

html = """

 
  Index of /images/AAVS
 
 
Index of /images/AAVS
  
   Name Last modified Size Description
   
Parent Directory           -  
20190823/              2019-09-19 19:37    -  
20190826/              2019-09-19 19:31    -  
desktop.ini            2019-09-19 19:24  136  
   

"""

l = re.findall(r'(?<=href=")[^"]+(?


Prints:

['?C=N;O=D', '?C=M;O=A', '?C=S;O=A', '?C=D;O=A', '/images/', '20190823/', '20190826/']


The above regex will accept any href value and that is why it returned values such as '?C=N;O=D'. If you wish to restrict it to values that are make up file and folder names that you are specifically looking for, you might use a more restrictive regex such as:

(?<=href=")[a-z0-9_./-]+(?


This will result in printing:

['/images/', '20190823/', '20190826/']


But, in fact, based on my research, ?C=N;O=D would be a legal filename in the Linux file system. 

You can even accomplish the task without using lookbehind or lookahead:

l = [m.group(1) for m in re.finditer(r'(?:href=")([^"]+)(?:")', html, flags=re.I) if not m.group(1).lower().endswith(".ini")]

Regex excluding all groups matched by negative lookahead

Answers (1)

Related Questions