Reputation: 3473

Regex without escaping Characters - Problems

I found some solutions for my problem, which is quite simple: I have a string, which is looking like this:

"\r\nContent-Disposition: form-data; name=\"ctl00$cphMainContent$grid$ctl03$ucPicture$ctl00\""

My goal is to break it down, so I have a Dictionary of values, like: Key = "name", value ? "ctl..."

My approach was: Split it by "\r\n" and then by the equal or the colon sign. This worked fine, but then some funny Tester uploaded a file with all allowed charactes, which made the String looking like this:

"\r\nContent-Disposition: form-data; name=\"ctl00_cphMainContent_grid_ctl03_ucPicture_btnUpload$fileUpload\"; filename=\"C:\\Users\\matthias.mueller\\Desktop\\- ie+![]{}_-´;,.$¨@#ç %&()=~^`'.jpg\"\r\nContent-Type: image/jpeg"

Of course, the simple splitting doesn't work anymore, since it splits now the filename. I corrected this by reading out "filename=" and escaping the signs I'm looking to split, and then creating a regex.

Now comes my problem: I found two Regex-samples, which could do the work for the equal sign, the semicolon and the colon. one is:

[^\\]=

The other one I found was:

(?<!\\\\)=

The problem is, the first one doesn't only split, but it splits the equal sign and one character before this sign, which means my key in the Dictionary is "nam" instead of "name"

The second one works fine on this matter, but it still splits the escaped equal sign in the filename.

Is my approach for this problem even working? Would there be a better solution for this? And why is the first Regex cutting a character?

Edit: To avoid confusion, my escaped String looks like this: "Content-Disposition: form-data; name=\"ctl00_cphMainContent_grid_ctl03_ucPicture_btnUpload$fileUpload\"; filename=\"C\:\Users\matthias.mueller\Desktop\- ie+![]{}_-´\;,.$¨@#ç %&()\=~^`'.jpg\""

So I want basically: Split by equal Sign EXCEPT the escaped ones. By the way: The string here shows only one \, but there are 2.

Edit 2: OK seems like I have a working solution, but it's so ugly:

Dictionary<string, string> ParseHeader(byte[] bytes, int pos)
    {
        Dictionary<string, string> items;
        string header;
        string[] headerLines;
        int start;
        int end;

        string input = _encoding.GetString(bytes, pos, bytes.Length - pos);

        start = input.IndexOf("\r\n", 0);
        if (start < 0) return null;

        end = input.IndexOf("\r\n\r\n", start);
        if (end < 0) return null;

        WriteBytes(false, bytes, pos, end + 4 - 0); // Write the header to the form content

        header = input.Substring(start, end - start);

        items = new Dictionary<string, string>();

        headerLines = Regex.Split(header, "\r\n");

        Regex regLineParts = new Regex(@"(?<!\\\\);");
        Regex regColon = new Regex(@"(?<!\\\\):");
        Regex regEqualSign = new Regex(@"(?<!\\\\)=");


        foreach (string hl in headerLines)
        {
            string workString = hl;
            //Escape the Semicolon in filename
            if (hl.Contains("filename"))
            {
                String orig = hl.Substring(hl.IndexOf("filename=\"") + 10);
                orig = orig.Substring(0, orig.IndexOf('"'));
                string toReplace = orig;
                toReplace = toReplace.Replace(toReplace, toReplace.Replace(";", @"\\;"));
                toReplace = toReplace.Replace(toReplace, toReplace.Replace(":", @"\\:"));
                toReplace = toReplace.Replace(toReplace, toReplace.Replace("=", @"\\="));
                workString = hl.Replace(orig, toReplace);
            }

            string[] lineParts = regLineParts.Split(workString);

            for (int i = 0; i < lineParts.Length; i++)
            {
                string[] p;

                if (i == 0)
                    p = regColon.Split(lineParts[i]);
                else
                    p = regEqualSign.Split(lineParts[i]);

                if (p.Length == 2)
                {
                    string orig = p[0];
                    orig = orig.Replace(@"\\;", ";");
                    orig = orig.Replace(@"\\:", ":");
                    orig = orig.Replace(@"\\=", "=");
                    p[0] = orig;

                    orig = p[1];
                    orig = orig.Replace(@"\\;", ";");
                    orig = orig.Replace(@"\\:", ":");
                    orig = orig.Replace(@"\\=", "=");
                    p[1] = orig;

                    items.Add(p[0].Trim(), p[1].Trim());
                }
            }
        }

        return items;
    }

Needs some further testing.

Upvotes: 0

Answers (3)

Steve Cooper

Reputation: 21480

I had a go at writing a parser for you. It handles literal strings, like "here is a string", as the values in name-value pairs. I've also written a few tests, and the last shows an '=' character inside a literal string. It also handles escaping quotes (") inside literal strings by escaping as \" -- I'm not sure if this is right, but you could change it.

A quick explanation. I first find anything that looks like a literal string and replace it with a value like PLACEHOLDER8230498234098230498. This means the whole thing is now literal name-value pairs; eg

key="value"

becomes

key=PLACEHOLDER8230498234098230498

The original string value is stored off in the literalStrings dictionary for later.

So now we split on semicolons (to get key=value strings) and then on equals, to get the proper key/value pairs.

Then I substitute the placeholder values back in before returning the result.

public class HttpHeaderParser
{
    public NameValueCollection Parse(string header)
    {
        var result = new NameValueCollection();

        // 'register' any string values;
        var stringLiteralRx = new Regex(@"""(?<content>(\\""|[^\""])+?)""", RegexOptions.IgnorePatternWhitespace);
        var equalsRx = new Regex("=", RegexOptions.IgnorePatternWhitespace);
        var semiRx = new Regex(";", RegexOptions.IgnorePatternWhitespace);

        Dictionary<string, string> literalStrings = new Dictionary<string, string>();
        var cleanedHeader = stringLiteralRx.Replace(header, m =>
        {
            var replacement = "PLACEHOLDER" + Guid.NewGuid().ToString("N");
            var stringLiteral = m.Groups["content"].Value.Replace("\\\"", "\"");
            literalStrings.Add(replacement, stringLiteral);
            return replacement;
        });

        // now it's safe to split on semicolons to get name-value pairs
        var nameValuePairs = semiRx.Split(cleanedHeader);
        foreach(var nameValuePair in nameValuePairs)
        {
            var nameAndValuePieces = equalsRx.Split(nameValuePair);
            var name = nameAndValuePieces[0].Trim();
            var value = nameAndValuePieces[1];
            string replacementValue;
            if (literalStrings.TryGetValue(value, out replacementValue))
            {
                value = replacementValue;
            }
            result.Add(name, value);
        }

        return result;

    }
}

There's every chance there are some proper bugs in it.

Here's some unit tests you should incorporate, too;

    [TestMethod]
    public void TestMethod1()
    {
        var tests = new[] {
            new { input=@"foo=bar; baz=quux", expected = @"foo|bar^baz|quux"},
            new { input=@"foo=bar;baz=""quux""", expected = @"foo|bar^baz|quux"},
            new { input=@"foo=""bar"";baz=""quux""", expected = @"foo|bar^baz|quux"},
            new { input=@"foo=""b,a,r"";baz=""quux""", expected = @"foo|b,a,r^baz|quux"},
            new { input=@"foo=""b;r"";baz=""quux""", expected = @"foo|b;r^baz|quux"},
            new { input=@"foo=""b\""r"";baz=""quux""", expected = @"foo|b""r^baz|quux"},
            new { input=@"foo=""b=r"";baz=""quux""", expected = @"foo|b=r^baz|quux"},
        };

        var parser = new HttpHeaderParser();
        foreach(var test in tests)
        {
            var actual = parser.Parse(test.input);
            var actualAsString = String.Join("^", actual.Keys.Cast<string>().Select(k => string.Format("{0}|{1}", k, actual[k])));
            Assert.AreEqual(test.expected, actualAsString);
        }
    }

Upvotes: 1

Steve Cooper

Reputation: 21480

Looks to me like you'll need a bit more of a solid parser for this than a regex split. According to this page the name/value pairs can either be 'raw';

x=1

or quoted;

x="foo bar baz"

So you'll need to look for a solution that not only splits on the equals, but ignores any equals inside;

x="y=z"

It might be that there is a better or more managed way for you to access this info. If you are using a classic ASP.NET WebForms FileUpload control, you can access the filename using the properties of the control, like

FileUpload1.HasFile
FileUpload1.FileName

If you're using MVC, you can use the HttpPostedFileBase class as a parameter to the action method. See this answer

[HttpPost]
public ActionResult Index(HttpPostedFileBase file)
{
    // Verify that the user selected a file
    if (file != null && file.ContentLength > 0) 
    {
        // extract only the fielname
        var fileName = Path.GetFileName(file.FileName);
        // store the file inside ~/App_Data/uploads folder
        var path = Path.Combine(Server.MapPath("~/App_Data/uploads"), fileName);
        file.SaveAs(path);
    }
    // redirect back to the index action to show the form once again
    return RedirectToAction("Index");        
}

Upvotes: 1

Kendall Frey

Reputation: 44326

This:

(?<!\\\\)=

matches = not preceded by \\.

It should be:

(?<!\\)=

(Make sure you use @ (verbatim) strings for the regex, to avoid confusion)

Upvotes: 0

Regex without escaping Characters - Problems

Answers (3)

Related Questions