Reputation: 562
I am starting to learn python spider to download some pictures on the web and I found the code as follows. I know some basic regex.
I knew \.jpg
means .jpg
and |
means or
. what's the meaning of [^\s]*?
of the first line? I am wondering why using \s
?
And what's the difference between the two regexes?
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
Upvotes: 3
Views: 70878
Reputation: 4874
Alright, so to answer your first question, I'll break down [^\s]*?
.
The square brackets ([]
) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time. [abc]
will match the strings a
, b
, and c
. In this case, your character class is negated using the caret (^
) at the beginning - this inverts its meaning, making it match anything but the characters in it.
\s
is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.
*?
is a little harder to explain. The *
quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The ?
, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.
In this case, what the whole pattern snippet [^\s]*?
means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?
.
To answer the second part of your question, I'll compare the two regexes you give:
http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)
They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:
) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/
) before matching any sequence of characters followed by a valid extension.
Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like
http:foo.bar.png
http:.png
Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:
http:// .jpg
http://foo bar.png
Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:
https?://\S+\.(jpe?g|png|gif)
In this case, it'll match URLs starting with both http
and https
, as well as files that end in both variations of jpg
.
Upvotes: 42