alnorth29
alnorth29

Reputation: 3602

Regex match for optional file extension

I'm extracting portions of URLs from text using a regular expression in Python. The URLs I'm looking for are from a limited set of patterns so it feels like I should just able to handle them in a regex. What I'm trying to extract is the first portion of the file name ("some.file.name" in all the examples below), which can include dots, letters and digits.

These are the sorts of forms the URL can take:

http://www.example.com/some.file.name.html
http://www.example.com/some.file.name_foo.html
http://www.example.com/some.file.name(123).html
http://www.example.com/some.file.name_foo(123).html
http://www.example.com/some.file.name
http://www.example.com/some.file.name_foo
http://www.example.com/some.file.name(123)
http://www.example.com/some.file.name_foo(123)

I think I'm pretty much there with this regex:

http://www\.example\.com/([a-zA-Z0-9\.]+)(_[a-z]+)?(\(\d+\))?(\.html)?

But it includes the ".html" in the match when the URL is like the first one in the list. Is there any way of stopping this or is it a fundamental limitation of regular expressions?

I'm quite happy to remove the extension in code as it will always be the same and will never be valid as part of the file name, but it would be cleaner to do it as part of the regex match.

Edit:

I should emphasise that these URLs are in bodies of text. I can't make any guarantees about whether there are characters before or after them or what those characters might be. I think it's safe to assume that they won't be numbers, letters, underscores or dots.

Upvotes: 3

Views: 5829

Answers (4)

Thomas Tempelmann
Thomas Tempelmann

Reputation: 12079

A more generic match where the file name and its extension could be anything:

^(.+?)(\.[a-zA-Z0-9_]*)?$

This non-greedily matches at least one char, then finds a period (.) and zero or more letters or digits or underscores (i.e. any char allowed in an extension) before the end of the name.

Test input with all possible file name / extension cases:

name.txt
name.tar.gz
.hidden
period.
plain name

Output for the first matched substring:

name
name.tar
.hidden
period
plain name

You may not want to see ".hidden" as a filename but as extension, though. Changing the .+? part into .*? will make ".hidden" be seen as an extension, if you prefer it that way (note, however, that operating systems such as Linux and macOS see this as a file name, not an extension).

If you want to allow any char (except period and space, of course) in the extension, use this instead:

^(.+?)(\.[^ .]*)?$

Upvotes: 1

Codelism
Codelism

Reputation: 184

It sounds to me that you don't care about the file extension. You just want to extract file names.

Try this one:

http://www\.example\.com/([\w]+.[\w]+.[\w()]+)

In PHP, I used preg_match_all($regex, $str, $matches), it returned something like this.

Array
(
    [0] => Array
        (
            [0] => http://www.example.com/some.file.name
            [1] => http://www.example.com/some.file.name_foo
            [2] => http://www.example.com/some.file.name(123)
            [3] => http://www.example.com/some.file.name_foo(123)
            [4] => http://www.example.com/some.file.name
            [5] => http://www.example.com/some.file.name_foo
            [6] => http://www.example.com/some.file.name(123)
            [7] => http://www.example.com/some.file.name_foo(123)
        )

    [1] => Array
        (
            [0] => some.file.name
            [1] => some.file.name_foo
            [2] => some.file.name(123)
            [3] => some.file.name_foo(123)
            [4] => some.file.name
            [5] => some.file.name_foo
            [6] => some.file.name(123)
            [7] => some.file.name_foo(123)
        )

)

Hope it helps!

Upvotes: 0

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77474

Regular expressions are matched greedy by default.

Try this regexp:

^http://www\.example\.com/([a-zA-Z0-9\.]+?)(_[a-z]+)?(\(\d+\))?(\.html)?$

Notice the extra ? added to not capture the .html in the first part. It makes the first group capture as little as neccessary to match, instead of as much as possible to match. Without the ?, the .html will be included in the first group, as the other groups are optional, and greedy matching tries to match as "early" as possible.

P.S. Also note that I anchored the regexp using ^ and $ to always match the full line.

Upvotes: 3

Philippe Leybaert
Philippe Leybaert

Reputation: 171824

You can specify the .html extension as a non-capturing group:

http://www\.example\.com/([a-zA-Z0-9\.]+)(_[a-z]+)?(\(\d+\))?(?=(\.html)?)

Upvotes: 0

Related Questions