mtyson
mtyson

Reputation: 8550

(Java) RegEx to get the URLs from CSS?

I'm parsing CSS to get the URLs out of linked style sheets. This is a Java app. (I tried using the CSSParser ( http://cssparser.sourceforge.net/ ), however, it is silently dropping many of the rules when it parses.)

So I'm just using Regex. I'd like a regex that gets me just the URLs, and is robust enough to deal with real css from the wild:

background-image: url('test/test.gif');
background: url("test2/test2.gif");
background-image: url(test3/test3.gif);
background: url   ( test4/ test4.gif );
background: url( " test5/test5.gif"   );

You get the idea. This is in Java's regex implementation (not my favorite).

Upvotes: 1

Views: 5042

Answers (2)

DanielGibbs
DanielGibbs

Reputation: 10183

Can you use ONLY regexs? Your life could be made so much easier if you used string functions to remove all the spaces, then you can write a regex that doesn't have to worry about the whitespace.

Here's a quick one, might not work very well:

background(-image)?:url\(["']?(.*)["']?\);

The second capture group should give you what you want.

The .* should probably be replaced with a character class that contains all the characters a valid path can contain.

Upvotes: 1

usr-local-ΕΨΗΕΛΩΝ
usr-local-ΕΨΗΕΛΩΝ

Reputation: 26874

The problem with regexes is that they are sometimes too strict than you need. If you shown us your currently non-perfectly-working regex I would have been able to help you more.

First comment: browsers tend to tolerate the majority of HTML/CSS mistakes (NOT JavaScript, which is a programming and not a markup language).

You could start with the background(-image)? token to lock the first part. How to proceed? Very difficult...

You always have colon, so you can add to the constant part of the token, and then, judging from your example (not from CSS specs) a variable number of whitespaces followed by url token. A variable number of whitespaces is [\w]*, and this becomes part of our regex.

I tried this with RegexBuddy

background(-image)?: url[\s]*\([\s]*(?<url>[^\)]*)\);

Unfortunately, it captures whitespaces inside URLs

Matched text: background-image: url('test/test.gif');
Match offset: 0
Match length: 39
Backreference 1: -image
Backreference 1 offset: 10
Backreference 1 length: 6
Backreference 2: 'test/test.gif'
Backreference 2 offset: 22
Backreference 2 length: 15

Matched text: background: url   ( test4/ test4.gif );
Match offset: 119
Match length: 39
Backreference 1: 
Backreference 1 offset: -1
Backreference 1 length: 0
Backreference 2:  test4/ test4.gif 
Backreference 2 offset: 138
Backreference 2 length: 18

So, when you get the URL with this you must trim the string. I couldn't exclude whitespaces from url group as of example 4, which, however, should match a URL with a whitespace in it, and which shouldn't be correct is this examples as soon as you don't have a %20test4.gif file

[Edit] I prefer the following version of the regex

background(-image)?: url[\s]*\([\s]*(?<url>[^\)]*)[\s]*\)[\s]*;

It tolerates more whitespaces

Upvotes: 6

Related Questions