(Java) RegEx to get the URLs from CSS?

Question

I'm parsing CSS to get the URLs out of linked style sheets. This is a Java app. (I tried using the CSSParser ( http://cssparser.sourceforge.net/ ), however, it is silently dropping many of the rules when it parses.)

So I'm just using Regex. I'd like a regex that gets me just the URLs, and is robust enough to deal with real css from the wild:

background-image: url('test/test.gif');
background: url("test2/test2.gif");
background-image: url(test3/test3.gif);
background: url   ( test4/ test4.gif );
background: url( " test5/test5.gif"   );

You get the idea. This is in Java's regex implementation (not my favorite).

usr-local-ΕΨΗΕΛΩΝ · Accepted Answer

The problem with regexes is that they are sometimes too strict than you need. If you shown us your currently non-perfectly-working regex I would have been able to help you more.

First comment: browsers tend to tolerate the majority of HTML/CSS mistakes (NOT JavaScript, which is a programming and not a markup language).

You could start with the background(-image)? token to lock the first part. How to proceed? Very difficult...

You always have colon, so you can add to the constant part of the token, and then, judging from your example (not from CSS specs) a variable number of whitespaces followed by url token. A variable number of whitespaces is [\w]*, and this becomes part of our regex.

I tried this with RegexBuddy

background(-image)?: url[\s]*$[\s]*(?[^$]*)\);

Unfortunately, it captures whitespaces inside URLs

Matched text: background-image: url('test/test.gif');
Match offset: 0
Match length: 39
Backreference 1: -image
Backreference 1 offset: 10
Backreference 1 length: 6
Backreference 2: 'test/test.gif'
Backreference 2 offset: 22
Backreference 2 length: 15

Matched text: background: url   ( test4/ test4.gif );
Match offset: 119
Match length: 39
Backreference 1: 
Backreference 1 offset: -1
Backreference 1 length: 0
Backreference 2:  test4/ test4.gif 
Backreference 2 offset: 138
Backreference 2 length: 18

So, when you get the URL with this you must trim the string. I couldn't exclude whitespaces from url group as of example 4, which, however, should match a URL with a whitespace in it, and which shouldn't be correct is this examples as soon as you don't have a %20test4.gif file

[Edit] I prefer the following version of the regex

background(-image)?: url[\s]*$[\s]*(?[^$]*)[\s]*\)[\s]*;

It tolerates more whitespaces

(Java) RegEx to get the URLs from CSS?

Answers (2)

Related Questions