LiuHao
LiuHao

Reputation: 562

What does the regex [^\s]*? mean?

I am starting to learn python spider to download some pictures on the web and I found the code as follows. I know some basic regex. I knew \.jpg means .jpg and | means or. what's the meaning of [^\s]*? of the first line? I am wondering why using \s? And what's the difference between the two regexes?

http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)

Upvotes: 3

Views: 70878

Answers (1)

Sebastian Lenartowicz
Sebastian Lenartowicz

Reputation: 4874

Alright, so to answer your first question, I'll break down [^\s]*?.

  • The square brackets ([]) indicate a character class. A character class basically means that you want to match anything in the class, at that position, one time. [abc] will match the strings a, b, and c. In this case, your character class is negated using the caret (^) at the beginning - this inverts its meaning, making it match anything but the characters in it.

  • \s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines.

  • *? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times". The ?, when applied to a quantifier, makes it lazy - it will match as little as it can, going from left to right one character at a time.

In this case, what the whole pattern snippet [^\s]*? means is "match any sequence of non-whitespace characters, including the empty string". As mentioned in the comments, this can more succinctly be written as \S*?.

To answer the second part of your question, I'll compare the two regexes you give:

http:[^\s]*?(\.jpg|\.png|\.gif)
http://.*?(\.jpg|\.png|\.gif)

They both start the same way: attempting to match the protocol at the beginning of a URL and the subsequent colon (:) character. The first then matches any string that does not contain any whitespace and ends with the specified file extensions. The second, meanwhile, will match two literal slash characters (/) before matching any sequence of characters followed by a valid extension.

Now, it's obvious that both patterns are meant to match a URL, but both are incorrect. The first pattern, for instance, will match strings like

http:foo.bar.png
http:.png

Both of which are invalid. Likewise, the second pattern will permit spaces, allowing stuff like this:

http:// .jpg
http://foo bar.png

Which is equally illegal in valid URLs. A better regex for this (though I caution strongly against trying to match URLs with regexes) might look like:

https?://\S+\.(jpe?g|png|gif)

In this case, it'll match URLs starting with both http and https, as well as files that end in both variations of jpg.

Upvotes: 42

Related Questions