Jared Smith
Jared Smith

Reputation: 22029

Make regex match dotfiles accurately

So I have hit the limit of my regex abilities with this. I have here a python regex to match a file path or file uri with named capture groups for the various parts. It seems to be working file, except on dotfiles.

MATCH_PATH = re.compile(
    r"^(?P<uri>file://)?" +             # optional file uri
    r"(?P<path>(?:/?[A-Z]{1}:)?" +      # start of path capture, optional windows top-level directory
    r"[\\/]?" +                         # optional start separator
    r"(?:[\w \-\.]+[\\/])+)" +          # path
    r"(?P<filename>[\w \-]+)?" +        # optional filename
    r"\.?(?P<extension>[a-zA-Z0-9]+)?$" # extension optional
)

I can make it match dotfiles by removing the optional qualifier ? after the . in the extension portion, but then it can't match files without an extension (e.g. makefile) nor directories. I tried placing a non-capturing group around the dot and the extension group with the optional qualifier, that didn't work: the extension gets grouped with the filename. Can I tweak this to match the extension and name correctly in all cases while still matching directories?

Examples inputs that should be matched:

/foo/bar.txt
/foo/bar/
/foo/makefile
./foo.txt
/foo/._bar.txt
foo/bar.txt
D:\foo\bar.m3u
file:///var/www/html/index.html
file:///C:/users/me/My Documents/index.html

UPDATE

Also needs to correctly match

/foo/bar.tar.gz
/foo/._bar.tar.gz

With the extension being tar.gz and names being bar and ._bar respectively. Also please let me know if this is too complex for regex and I can write procedural code to split and process instead.

Upvotes: 2

Views: 334

Answers (2)

Jan
Jan

Reputation: 43189

You may very well use named captured groups in a lookahead, like so:

^
(?P<uri>file://)?
(?P<path>(?:/?[A-Z]{1}:)?           # start of path capture, optional windows top-level directory
[\\/]?                              # optional start separator
(?:[-. \w]+[\\/])+)                 # path
(?P<filename>\.?[^.]+?(?=\.(?P<extension>.+$)|$))?

See a demo on regex101.com.


Only thing I changed, is the group filename:

(?P<filename>\.?[^.]+?(?=\.(?P<extension>.+$)|$))

It uses a lazy dot-star with a positive lookahead, looking either for .some_extension (then saving it to extension) or the end of the line.

Upvotes: 1

m_callens
m_callens

Reputation: 6360

I managed to clean it up a bit and get the regex to match all of your sample data. Here is there testing environment so you can see it is working with the different capturing groups.

^(?P<uri>file:\/\/\/)?
(?P<path>(?:\/|\\|\.)?(?:[A-Z]:(?:\/|\\))?(?:[\w \-\.]+[\/\\])+)
(?P<file>\.?[\_\w ]+)?
(?P<extension>\.[\w\d]+)?$

I think the main issue with the one you have is that you aren't including the possibility of the preceding . in the file capturing group. To remedy that, I added it as an optional preceding . with the file group and worked around that.

The other small change I made was including the extension's preceding . in the the extension group, but that can be changed if you want.

Upvotes: 1

Related Questions