haughtonomous
haughtonomous

Reputation: 4850

Regex doesn't give me expected result

Okay, I give up - time to call upon the regex gurus for some help.

I'm trying to validate CSV file contents, just to see if it looks like the expected valid CSV data. I'm not trying to validate all possible CSV forms, just that it "looks like" CSV data and isn't binary data, a code file or whatever.

Each line of data comprises comma-separated words, each word comprising a-z, 0-9, and a small number of of punctuation chars, namely - and _. There may be several lines in the file. That's it.

Here's my simple code:

const string dataWord = @"[a-z0-9_\-]+";
const string dataLine = "("+dataWord+@"\s*,\s*)*"+dataWord;
const string csvDataFormat = "("+dataLine+") |  (("+dataLine+@"\r\n)*"+dataLine +")";

Regex validCSVDataPattern = new Regex(csvDataFormat, RegexOptions.IgnoreCase);
protected override bool IsCorrectDataFormat(string fileContents)
{
    return validCSVDataPattern.IsMatch(fileContents);
}

This gives me a regex pattern of

(([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+) |  ((([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+\r\n)*([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+)

However if I present this with a block of, say, C# code, the regex parser says it is a match. How is that? the C# code doesn't look anything like my CSV pattern (it has punctuation other than _ and -, for a start).

Can anyone point out my obvious error? Let me repeat - I am not trying to validate all possible CSV forms, just my simple subset.

Upvotes: 0

Views: 336

Answers (4)

Alan Moore
Alan Moore

Reputation: 75222

I think this is what you're looking for:

@"(?in)^[a-z0-9_-]+( *, *[a-z0-9_-]+)*([\r\n]+[a-z0-9_-]+( *, *[a-z0-9_-]+)*)*$"

The noteworthy changes are:

  • Added anchors (^ and $, because the regex is totally pointless without them
  • Removed spaces (which have to match literal spaces, and I don't think that's what you intended)
  • Replaced the \s in every occurrence of \s* with a literal space (because \s can match any whitespace character, and you only want to match actual spaces in those spots)

The basic structure of your regex looked pretty good until that | came along and bollixed things up. ;)

p.s., In case you're wondering, (?in) is an inline modifier that sets IgnoreCase and ExplicitCapture modes.

Upvotes: 0

unlimit
unlimit

Reputation: 3752

I came up with this regex:

^([a-z0-9_\-]+)(\s*)(,\s*[a-z0-9_\-]+)*$

Tests

asbc_- ,   khkhkjh,    lkjlkjlkj_-,     j : PASS
asbc,                                     : FAIL
asbc_-,khkhkjh,lkjlkjlk909j_-,j           : PASS

If you want to match empty lines like ,,, or when some values are blank like ,abcd,, use

^([a-z0-9_\-]*)(\s*)(,\s*[a-z0-9_\-]*)*$

Loop through all the lines to see if the file is ok:

const string dataLine = "^([a-z0-9_\-]+)(\s*)(,\s*[a-z0-9_\-]+)*$";
Regex validCSVDataPattern = new Regex(csvDataFormat, RegexOptions.IgnoreCase);
protected override bool IsCorrectDataFormat(string fileContents)
{
    string[] lines = fileContents.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);

    foreach (var line in lines)
    {
        if (!validCSVDataPattern.IsMatch(line))
        return false;
    }

    return true;
}

Upvotes: 0

ΩmegaMan
ΩmegaMan

Reputation: 31606

Here is a better pattern which looks for CSV groups such as XXX, or yyy for one to many in each line:

^([\w\s_\-]*,?)+$

^ - Start of each line

( - a CSV match group start

[\w\s_\-]* - Valid characters \w (a-zA-Z0-9) and _ and - in each CSV

,? - maybe a comma

)+ - End of the csv match group, 1 to many of these expected.

That will validate a whole file, line by line for a basic CSV structure and allow for empty ,, situations.

Upvotes: 1

Jon
Jon

Reputation: 437336

Your regular expression is missing the ^ (beginning of line) and $ (end of line) anchors. This means that it would match any text that contains what is described by the expression, even if the text contains other completely unrelated parts.

For example, this text matches the expression:

foo, bar

and therefore this text also matches:

var result = calculate(foo, bar);

You can see where this is going.

Add ^ at the beginning and $ at the end of csvDataFormat to get the behavior you expect.

Upvotes: 4

Related Questions