Reputation: 22029
So I have hit the limit of my regex abilities with this. I have here a python regex to match a file path or file uri with named capture groups for the various parts. It seems to be working file, except on dotfiles.
MATCH_PATH = re.compile(
r"^(?P<uri>file://)?" + # optional file uri
r"(?P<path>(?:/?[A-Z]{1}:)?" + # start of path capture, optional windows top-level directory
r"[\\/]?" + # optional start separator
r"(?:[\w \-\.]+[\\/])+)" + # path
r"(?P<filename>[\w \-]+)?" + # optional filename
r"\.?(?P<extension>[a-zA-Z0-9]+)?$" # extension optional
)
I can make it match dotfiles by removing the optional qualifier ?
after the .
in the extension portion, but then it can't match files without an extension (e.g. makefile) nor directories. I tried placing a non-capturing group around the dot and the extension group with the optional qualifier, that didn't work: the extension gets grouped with the filename. Can I tweak this to match the extension and name correctly in all cases while still matching directories?
Examples inputs that should be matched:
/foo/bar.txt
/foo/bar/
/foo/makefile
./foo.txt
/foo/._bar.txt
foo/bar.txt
D:\foo\bar.m3u
file:///var/www/html/index.html
file:///C:/users/me/My Documents/index.html
Also needs to correctly match
/foo/bar.tar.gz
/foo/._bar.tar.gz
With the extension being tar.gz
and names being bar
and ._bar
respectively. Also please let me know if this is too complex for regex and I can write procedural code to split and process instead.
Upvotes: 2
Views: 334
Reputation: 43189
You may very well use named captured groups in a lookahead, like so:
^
(?P<uri>file://)?
(?P<path>(?:/?[A-Z]{1}:)? # start of path capture, optional windows top-level directory
[\\/]? # optional start separator
(?:[-. \w]+[\\/])+) # path
(?P<filename>\.?[^.]+?(?=\.(?P<extension>.+$)|$))?
filename
:
(?P<filename>\.?[^.]+?(?=\.(?P<extension>.+$)|$))
It uses a lazy dot-star with a positive lookahead, looking either for .some_extension
(then saving it to extension
) or the end of the line.
Upvotes: 1
Reputation: 6360
I managed to clean it up a bit and get the regex to match all of your sample data. Here is there testing environment so you can see it is working with the different capturing groups.
^(?P<uri>file:\/\/\/)?
(?P<path>(?:\/|\\|\.)?(?:[A-Z]:(?:\/|\\))?(?:[\w \-\.]+[\/\\])+)
(?P<file>\.?[\_\w ]+)?
(?P<extension>\.[\w\d]+)?$
I think the main issue with the one you have is that you aren't including the possibility of the preceding .
in the file
capturing group. To remedy that, I added it as an optional preceding .
with the file
group and worked around that.
The other small change I made was including the extension's preceding .
in the the extension
group, but that can be changed if you want.
Upvotes: 1