Reputation: 3602
I'm extracting portions of URLs from text using a regular expression in Python. The URLs I'm looking for are from a limited set of patterns so it feels like I should just able to handle them in a regex. What I'm trying to extract is the first portion of the file name ("some.file.name" in all the examples below), which can include dots, letters and digits.
These are the sorts of forms the URL can take:
http://www.example.com/some.file.name.html
http://www.example.com/some.file.name_foo.html
http://www.example.com/some.file.name(123).html
http://www.example.com/some.file.name_foo(123).html
http://www.example.com/some.file.name
http://www.example.com/some.file.name_foo
http://www.example.com/some.file.name(123)
http://www.example.com/some.file.name_foo(123)
I think I'm pretty much there with this regex:
http://www\.example\.com/([a-zA-Z0-9\.]+)(_[a-z]+)?(\(\d+\))?(\.html)?
But it includes the ".html" in the match when the URL is like the first one in the list. Is there any way of stopping this or is it a fundamental limitation of regular expressions?
I'm quite happy to remove the extension in code as it will always be the same and will never be valid as part of the file name, but it would be cleaner to do it as part of the regex match.
Edit:
I should emphasise that these URLs are in bodies of text. I can't make any guarantees about whether there are characters before or after them or what those characters might be. I think it's safe to assume that they won't be numbers, letters, underscores or dots.
Upvotes: 3
Views: 5829
Reputation: 12079
A more generic match where the file name and its extension could be anything:
^(.+?)(\.[a-zA-Z0-9_]*)?$
This non-greedily matches at least one char, then finds a period (.
) and zero or more letters or digits or underscores (i.e. any char allowed in an extension) before the end of the name.
Test input with all possible file name / extension cases:
name.txt
name.tar.gz
.hidden
period.
plain name
Output for the first matched substring:
name
name.tar
.hidden
period
plain name
You may not want to see ".hidden" as a filename but as extension, though. Changing the .+?
part into .*?
will make ".hidden" be seen as an extension, if you prefer it that way (note, however, that operating systems such as Linux and macOS see this as a file name, not an extension).
If you want to allow any char (except period and space, of course) in the extension, use this instead:
^(.+?)(\.[^ .]*)?$
Upvotes: 1
Reputation: 184
It sounds to me that you don't care about the file extension. You just want to extract file names.
Try this one:
http://www\.example\.com/([\w]+.[\w]+.[\w()]+)
In PHP, I used preg_match_all($regex, $str, $matches), it returned something like this.
Array
(
[0] => Array
(
[0] => http://www.example.com/some.file.name
[1] => http://www.example.com/some.file.name_foo
[2] => http://www.example.com/some.file.name(123)
[3] => http://www.example.com/some.file.name_foo(123)
[4] => http://www.example.com/some.file.name
[5] => http://www.example.com/some.file.name_foo
[6] => http://www.example.com/some.file.name(123)
[7] => http://www.example.com/some.file.name_foo(123)
)
[1] => Array
(
[0] => some.file.name
[1] => some.file.name_foo
[2] => some.file.name(123)
[3] => some.file.name_foo(123)
[4] => some.file.name
[5] => some.file.name_foo
[6] => some.file.name(123)
[7] => some.file.name_foo(123)
)
)
Hope it helps!
Upvotes: 0
Reputation: 77474
Regular expressions are matched greedy by default.
Try this regexp:
^http://www\.example\.com/([a-zA-Z0-9\.]+?)(_[a-z]+)?(\(\d+\))?(\.html)?$
Notice the extra ?
added to not capture the .html
in the first part. It makes the first group capture as little as neccessary to match, instead of as much as possible to match. Without the ?
, the .html
will be included in the first group, as the other groups are optional, and greedy matching tries to match as "early" as possible.
P.S. Also note that I anchored the regexp using ^
and $
to always match the full line.
Upvotes: 3
Reputation: 171824
You can specify the .html extension as a non-capturing group:
http://www\.example\.com/([a-zA-Z0-9\.]+)(_[a-z]+)?(\(\d+\))?(?=(\.html)?)
Upvotes: 0