Reputation: 8550
I'm parsing CSS to get the URLs out of linked style sheets. This is a Java app. (I tried using the CSSParser ( http://cssparser.sourceforge.net/ ), however, it is silently dropping many of the rules when it parses.)
So I'm just using Regex. I'd like a regex that gets me just the URLs, and is robust enough to deal with real css from the wild:
background-image: url('test/test.gif');
background: url("test2/test2.gif");
background-image: url(test3/test3.gif);
background: url ( test4/ test4.gif );
background: url( " test5/test5.gif" );
You get the idea. This is in Java's regex implementation (not my favorite).
Upvotes: 1
Views: 5042
Reputation: 10183
Can you use ONLY regexs? Your life could be made so much easier if you used string functions to remove all the spaces, then you can write a regex that doesn't have to worry about the whitespace.
Here's a quick one, might not work very well:
background(-image)?:url\(["']?(.*)["']?\);
The second capture group should give you what you want.
The .*
should probably be replaced with a character class that contains all the characters a valid path can contain.
Upvotes: 1
Reputation: 26874
The problem with regexes is that they are sometimes too strict than you need. If you shown us your currently non-perfectly-working regex I would have been able to help you more.
First comment: browsers tend to tolerate the majority of HTML/CSS mistakes (NOT JavaScript, which is a programming and not a markup language).
You could start with the background(-image)?
token to lock the first part. How to proceed? Very difficult...
You always have colon, so you can add to the constant part of the token, and then, judging from your example (not from CSS specs) a variable number of whitespaces followed by url
token. A variable number of whitespaces is [\w]*
, and this becomes part of our regex.
I tried this with RegexBuddy
background(-image)?: url[\s]*\([\s]*(?<url>[^\)]*)\);
Unfortunately, it captures whitespaces inside URLs
Matched text: background-image: url('test/test.gif');
Match offset: 0
Match length: 39
Backreference 1: -image
Backreference 1 offset: 10
Backreference 1 length: 6
Backreference 2: 'test/test.gif'
Backreference 2 offset: 22
Backreference 2 length: 15
Matched text: background: url ( test4/ test4.gif );
Match offset: 119
Match length: 39
Backreference 1:
Backreference 1 offset: -1
Backreference 1 length: 0
Backreference 2: test4/ test4.gif
Backreference 2 offset: 138
Backreference 2 length: 18
So, when you get the URL with this you must trim the string. I couldn't exclude whitespaces from url
group as of example 4, which, however, should match a URL with a whitespace in it, and which shouldn't be correct is this examples as soon as you don't have a %20test4.gif
file
[Edit] I prefer the following version of the regex
background(-image)?: url[\s]*\([\s]*(?<url>[^\)]*)[\s]*\)[\s]*;
It tolerates more whitespaces
Upvotes: 6