Athari
Athari

Reputation: 34275

Parsing CSS in C#: extracting all URLs

I need to get all URLs (url() expressions) from CSS files. For example:

b { background: url(img0) }
b { background: url("img1") }
b { background: url('img2') }
b { background: url( img3 ) }
b { background: url( "img4" ) }
b { background: url( 'img5' ) }
b { background: url (img6) }
b { background: url ("img7") }
b { background: url ('img8') }
{ background: url('noimg0) }
{ background: url(noimg1') }
/*b { background: url(noimg2) }*/
b { color: url(noimg3) }
b { content: 'url(noimg4)' }
@media screen and (max-width: 1280px) { b { background: url(img9) } }
b { background: url(img10) }

I need to get all img* URLs, but not noimg* URLs (invalid syntax or invalid property or inside comments).

I've tried using good old regular expressions. After some trial and error I got this:

private static IEnumerable<string> ParseUrlsRegex (string source)
{
    var reUrls = new Regex(@"(?nx)
        url \s* \( \s*
            (
                (?! ['""] )
                (?<Url> [^\)]+ )
                (?<! ['""] )
                |
                (?<Quote> ['""] )
                (?<Url> .+? )
                \k<Quote>
            )
        \s* \)");
    return reUrls.Matches(source)
        .Cast<Match>()
        .Select(match => match.Groups["Url"].Value);
}

That's one crazy regex, but it still doesn't work -- it matches 3 invalid URLs (namely, 2, 3 and 4). Furthermore, everyone will say that using regex for parsing complex grammar is wrong.

Let's try another approach. According to this question, the only viable option is ExCSS (others are either too simple or outdated). With ExCSS I got this:

    private static IEnumerable<string> ParseUrlsExCss (string source)
    {
        var parser = new StylesheetParser();
        parser.Parse(source);
        return parser.Stylesheet.RuleSets
            .SelectMany(i => i.Declarations)
            .SelectMany(i => i.Expression.Terms)
            .Where(i => i.Type == TermType.Url)
            .Select(i => i.Value);
    }

Unlike regex solution, this one doesn't list invalid URLs. But it doesn't list some valid ones! Namely, 9 and 10. Looks like this is known issue with some CSS syntax, and it can't be fixed without rewriting the whole library from scratch. ANTLR rewrite seems to be abandoned.

Question: How to extract all URLs from CSS files? (I need to parse any CSS files, not only the one provided as an example above. Please don't heck for "noimg" or assume one-line declarations.)

N.B. This is not a "tool recommendation" question, as any solution will be fine, be it a piece of code, a fix to one of the above solutions, a library or anything else; and I've clearly defined the function I need.

Upvotes: 9

Views: 5310

Answers (9)

Jonathan Wood
Jonathan Wood

Reputation: 67175

RegEx is a very powerful tool. But when a bit more flexibility is needed, I prefer to just write a little code.

So for a non-RegEx solution, I came up with the following. Note that a bit more work would be needed to make this code more generic to handle any CSS file. For that, I would also use my text parsing helper class.

IEnumerable<string> GetUrls(string css)
{
    char[] trimChars = new char[] { '\'', '"', ' ', '\t', };

    foreach (var line in css.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries))
    {
        // Extract portion within curly braces (this version assumes all on one line)
        int start = line.IndexOf('{');
        int end = line.IndexOf('}', start + 1);
        if (start < 0 || end < 0)
            continue;
        start++; end--; // Remove braces

        // Get value portion
        start = line.IndexOf(':', start);
        if (start < 0)
            continue;

        // Extract value and trime whitespace and quotes
        string content = line.Substring(start + 1, end - start).Trim(trimChars);

        // Extract URL from url() value
        if (!content.StartsWith("url", StringComparison.InvariantCultureIgnoreCase))
            continue;
        start = content.IndexOf('(');
        end = content.IndexOf(')', start + 1);
        if (start < 0 || end < 0)
            continue;
        start++;
        content = content.Substring(start, end - start).Trim(trimChars);

        if (!content.StartsWith("noimg", StringComparison.InvariantCultureIgnoreCase))
            yield return content;
    }
}

UPDATE:

What you appear to be asking seems beyond the scope of a simple how-to question for stackoverflow. I do not believe you will get satisfactory results using regular expressions. You will need some code to parse your CSS, and handle all the special cases that come with it.

Since I've written a lot of parsing code and had a bit of time, I decided to play with this a bit. I wrote a simple CSS parser and wrote an article about it. You can read the article and download the code (for free) at A Simple CSS Parser.

My code parses a block of CSS and stores the information in data structures. My code separates and stores each property/value pair for each rule. However, a bit more work is still needed to get the URL from the property values. You will need to parse them from the property value.

The code I originally posted will give you a start of how you might approach this. But if you want a truly robust solution, then some more sophisticated code will be needed. You might want to take a look at my code to parse the CSS. I use techniques in that code that could be used to easy handle values such as url('img(1)'), such as parsing a quoted value.

I think this is a pretty good start. I could write the remaining code for you as well. But what's the fun in that. :)

Upvotes: 5

Sajith
Sajith

Reputation: 856

You can try this pattern like this there is more help full

@import ([""'])(?<url>[^""']+)\1|url\(([""']?)(?<url>[^""')]+)\2\)

Or

http://www.c-sharpcorner.com/uploadfile/rahul4_saxena/reading-and-parsing-a-css-file-in-Asp-Net/

Upvotes: 1

alpha bravo
alpha bravo

Reputation: 7948

This RegEx seems to solve the example provided:

background: url\s*\(\s*(["'])?\K\w+(?(1)(?=\1)|(?=\s*\)))(?!.*\*/)

Upvotes: 1

Roger Barreto
Roger Barreto

Reputation: 2284

For such a problem the simpler approach could do the trick.

  1. Break all the css comands in lines (supose the css is simplified), in this case I would break in the ";" or "}" command.

  2. Read all the occurences inside url(*), even the wrong ones.

  3. Create a pipeline with command pattern that detects wich lines are really eligible

    • 3.1 Command1 (Detect comment)
    • 3.2 Command2 (Detect syntax error URL)
    • 3.3 ...
  4. With the OK lines flagged, extract the OK Url's

This is a simple approach and solves the problem with efficiency and no ultra complex unmanageble magical Regex.

Upvotes: 1

Athari
Athari

Reputation: 34275

Finally got Alba.CsCss, my port of CSS parser from Mozilla Firefox, working.

First and foremost, the question contains two errors:

  1. url (img) syntax is incorrect, because space is not allowed between url and ( in CSS grammar. Therefore, "img6", "img7" and "img8" should not be returned as URLs.

  2. An unclosed quote in url function (url('img)) is a serious syntax error; web browsers, including Firefox, do not seem to recover from it and simply skip the rest of the CSS file. Therefore, requiring the parser to return "img9" and "img10" is unnecessary (but necessary if the two problematic lines are removed).

With CsCss, there are two solutions.

The first solution is to rely just on the tokenizer CssScanner.

List<string> uris = new CssLoader().GetUris(source).ToList();

This will return all "img" URLs (except mentioned in the error #1 above), but will also include "noimg3" as property names are not checked.

The second solution is to properly parse the CSS file. This will most closely mimic the behavior of browsers (including stopping parsing after an unclosed quote).

var css = new CssLoader().ParseSheet(source, SheetUri, BaseUri);
List<string> uris = css.AllStyleRules
    .SelectMany(styleRule => styleRule.Declaration.AllData)
    .SelectMany(prop => prop.Value.Unit == CssUnit.List
        ? prop.Value.List : new[] { prop.Value })
    .Where(value => value.Unit == CssUnit.Url)
    .Select(value => value.OriginalUri)
    .ToList();

If the two problematic lines are removed, this will return all correct "img" URLs.

(The LINQ query is complex, because background-image property in CSS3 can contain a list of URLs.)

Upvotes: 6

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

This solution can avoid comments, and deals with background-image. It deals too with background which can contain properties like background-color, background-position, or repeat, that is not the case with background-image. This is why I have added these cases: noimg5, img11, img12.

The datas:

string subject =
    @"b { background: url(img0) }
      b { background: url(""img1"") }
      b { background: url('img2') }
      b { background: url( img3 ) }
      b { background: url( ""img4"" ) }
      b { background: url( 'img5' ) }
      b { background: url (img6) }
      b { background: url (""img7"") }
      b { background: url ('img8') }
      { background: url('noimg0) }
      { background: url(noimg1') }
      /*b { background: url(noimg2) }*/
      b { color: url(noimg3) }
      b { content: 'url(noimg4)' }
      @media screen and (max-width: 1280px) { b { background: url(img9) } }
      b { background: url(img10) }
      b { background: #FFCC66 url('img11') no-repeat }
      b { background-image: url('img12'); }
      b { background-image: #FFCC66 url('noimg5') }";

The pattern:

Comments are avoided because they are matched first. If a comment is leave open (without */, then all the content after is considered as a comment (?>\*/|$).

The result is stored in the named capture url.

string pattern = @"
        /\*  (?> [^*] | \*(?!/) )*  (?>\*/|$)  # comments
      |
        (?<=
            background
            (?>
                -image \s* :     # optional '-image'
              |
                \s* :
                (?>              # allowed content before url 
                    \s*
                    [^;{}u\s]+   # all that is not a ; { } u
                    \s           # must be followed by one space at least
                )?
            )

            \s* url \s* \( \s*
            ([""']?)             # optional quote (single or double) in group 1
        )
        (?<url> [^""')\s]+ )     # named capture 'url' with an url inside
        (?=\1\s*\))              # must be followed by group 1 content (optional quote)
              ";
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace;
Match m = Regex.Match(subject, pattern, options);
List<string> urls = new List<string>();
while (m.Success)
{
    string url = m.Groups["url"].ToString();
    if (url!="") {
        urls.Add(url);
        Console.WriteLine(url);
    }
    m = m.NextMatch();
}

Upvotes: 1

asontu
asontu

Reputation: 4639

You need negative lookbehind to see if there is no /* without a following */ like this:

(?<!\/\*([^*]|\*[^\/])*)

This seems unreadable, it means:

(?<! -> preceding this match may not be:

\/\* -> /* (with escape slashes) followed by

([^*] -> any character that isn't *

|\*[^\/]) -> or a character that is *, but is itself followed by anything that isn't /

*) -> of this not a * or a * without a / character we can have 0 or more, and finally close the negative lookbehind

And you need positive lookbehind to see whether the property being set is a css property that accepts url() values. If you only are interested in background: and background-image: for instance, this would be the entire regex:

(?<!\/\*([^*]|\*[^\/])*)
(?<=background(?:-image)?:\s*)
url\s*\(\s*(('|")?)[^\n'"]+\1\s*\)

Since this version requires the css property background: or background-image: to precede the url(), it will not detect the 'url(noimg4)'. You could use simple pipes to add more accepted css properties: (?<=(?:border-image|background(?:-image)?):\s*)

I've used \1 rather than \k<Quote> because I'm not familiar with that syntax, which means you need the ?: to not capture unwanted subgroups. As far as I can test this works.

Finally I used [^\n'"] for the actual url because I understand from your comments that url('img(1)') should work and [^\)] from your OP won't parse that.

Upvotes: 1

AlliterativeAlice
AlliterativeAlice

Reputation: 12577

Probably not the most elegant possible solution, but seems to do the job you need done.

public static List<string> GetValidUrlsFromCSS(string cssStr)
{
    //Enter properties that can validly contain a URL here (in lowercase):
    List<string> validProperties = new List<string>(new string[] { "background", "background-image" });

    List<string> validUrls = new List<string>();
    //We'll use your regex for extracting the valid URLs
    var reUrls = new Regex(@"(?nx)
        url \s* \( \s*
            (
                (?! ['""] )
                (?<Url> [^\)]+ )
                (?<! ['""] )
                |
                (?<Quote> ['""] )
                (?<Url> .+? )
                \k<Quote>
            )
        \s* \)");
    //First, remove all the comments
    cssStr = Regex.Replace(cssStr, "\\/\\*.*?\\*\\/", String.Empty);
    //Next remove all the the property groups with no selector
    string oldStr;
    do
    {
        oldStr = cssStr;
        cssStr = Regex.Replace(cssStr, "(^|{|})(\\s*{[^}]*})", "$1");
    } while (cssStr != oldStr);
    //Get properties
    var matches = Regex.Matches(cssStr, "({|;)([^:{;]+:[^;}]+)(;|})");
    foreach (Match match in matches)
    {
        string matchVal = match.Groups[2].Value;
        string[] matchArr = matchVal.Split(':');
        if (validProperties.Contains(matchArr[0].Trim().ToLower()))
        {
            //Since this is a valid property, extract the URL (if there is one)
            MatchCollection validUrlCollection = reUrls.Matches(matchVal);
            if (validUrlCollection.Count > 0)
            {
                validUrls.Add(validUrlCollection[0].Groups["Url"].Value);
            }
        }
    }
    return validUrls;
}

Upvotes: 1

Piotr Stapp
Piotr Stapp

Reputation: 19830

In my opinion you created too much complicated RegExp. The working one is following: url\s*[(][\s'""]*(?<Url>img[\w]*)[\s'""]*[)]. I will try to explain what I'm searching:

  1. Start with url
  2. Then all whitespaces after it (\s*)
  3. Next is exactly one left bracket ([(])
  4. The 0 or more chars like: whitespace, ", ' ([\s'""]*)
  5. Next the "URL" so something starting with img and ending with zero or more alpha-numeric chars ((?<Url>img[\w]*))
  6. Again 0 or more chars like: whitespace, ", ' ([\s'""]*)
  7. And end with right bracket [)]

The full working code:

        var source =
            "b { background: url(img0) }\n" +
            "b { background: url(\"img1\") }\n" +
            "b { background: url(\'img2\') }\n" +
            "b { background: url( img3 ) }\n" +
            "b { background: url( \"img4\" ) }\n" +
            "b { background: url( \'img5\' ) }\n" +
            "b { background: url (img6) }\n" +
            "b { background: url (\"img7\") }\n" +
            "b { background: url (\'img8\') }\n" +
            "{ background: url(\'noimg0) }\n" +
            "{ background: url(noimg1\') }\n" +
            "/*b { background: url(noimg2) }*/\n" +
            "b { color: url(noimg3) }\n" +
            "b { content: \'url(noimg4)\' }\n" +
            "@media screen and (max-width: 1280px) { b { background: url(img9) } }\n" +
            "b { background: url(img10) }";


        string strRegex = @"url\s*[(][\s'""]*(?<Url>img[\w]*)[\s'""]*[)]";
        var reUrls = new Regex(strRegex);

        var result = reUrls.Matches(source)
                           .Cast<Match>()
                           .Select(match => match.Groups["Url"].Value).ToArray();
        bool isOk = true;
        for (var i = 0; i <= 10; i++)
        {
            if (!result.Contains("img" + i))
            {
                Console.WriteLine("Missing img"+i);
                isOk = false;
            }
        }
        for (var i = 0; i <= 4; i++)
        {
            if (result.Contains("noimg" + i))
            {
                Console.WriteLine("Redundant noimg" + i);
                isOk = false;
            }
        }
        if (isOk)
        {
            Console.WriteLine("Yes. It is ok :). The result is:");
            foreach (var s in result)
            {
                Console.WriteLine(s);
            }

        }
        Console.ReadLine();

Upvotes: 2

Related Questions