Reputation: 4850
Okay, I give up - time to call upon the regex gurus for some help.
I'm trying to validate CSV file contents, just to see if it looks like the expected valid CSV data. I'm not trying to validate all possible CSV forms, just that it "looks like" CSV data and isn't binary data, a code file or whatever.
Each line of data comprises comma-separated words, each word comprising a-z
, 0-9
, and a small number of of punctuation chars, namely -
and _
. There may be several lines in the file. That's it.
Here's my simple code:
const string dataWord = @"[a-z0-9_\-]+";
const string dataLine = "("+dataWord+@"\s*,\s*)*"+dataWord;
const string csvDataFormat = "("+dataLine+") | (("+dataLine+@"\r\n)*"+dataLine +")";
Regex validCSVDataPattern = new Regex(csvDataFormat, RegexOptions.IgnoreCase);
protected override bool IsCorrectDataFormat(string fileContents)
{
return validCSVDataPattern.IsMatch(fileContents);
}
This gives me a regex pattern of
(([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+) | ((([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+\r\n)*([a-z0-9_\-]+\s*,\s*)*[a-z0-9_\-]+)
However if I present this with a block of, say, C# code, the regex parser says it is a match. How is that? the C# code doesn't look anything like my CSV pattern (it has punctuation other than _
and -
, for a start).
Can anyone point out my obvious error? Let me repeat - I am not trying to validate all possible CSV forms, just my simple subset.
Upvotes: 0
Views: 336
Reputation: 75222
I think this is what you're looking for:
@"(?in)^[a-z0-9_-]+( *, *[a-z0-9_-]+)*([\r\n]+[a-z0-9_-]+( *, *[a-z0-9_-]+)*)*$"
The noteworthy changes are:
^
and $
, because the regex is totally pointless without them\s
in every occurrence of \s*
with a literal space (because \s
can match any whitespace character, and you only want to match actual spaces in those spots)The basic structure of your regex looked pretty good until that |
came along and bollixed things up. ;)
p.s., In case you're wondering, (?in)
is an inline modifier that sets IgnoreCase
and ExplicitCapture
modes.
Upvotes: 0
Reputation: 3752
I came up with this regex:
^([a-z0-9_\-]+)(\s*)(,\s*[a-z0-9_\-]+)*$
Tests
asbc_- , khkhkjh, lkjlkjlkj_-, j : PASS
asbc, : FAIL
asbc_-,khkhkjh,lkjlkjlk909j_-,j : PASS
If you want to match empty lines like ,,,
or when some values are blank like ,abcd,,
use
^([a-z0-9_\-]*)(\s*)(,\s*[a-z0-9_\-]*)*$
Loop through all the lines to see if the file is ok:
const string dataLine = "^([a-z0-9_\-]+)(\s*)(,\s*[a-z0-9_\-]+)*$";
Regex validCSVDataPattern = new Regex(csvDataFormat, RegexOptions.IgnoreCase);
protected override bool IsCorrectDataFormat(string fileContents)
{
string[] lines = fileContents.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
foreach (var line in lines)
{
if (!validCSVDataPattern.IsMatch(line))
return false;
}
return true;
}
Upvotes: 0
Reputation: 31606
Here is a better pattern which looks for CSV groups such as XXX,
or yyy
for one to many in each line:
^([\w\s_\-]*,?)+$
^
- Start of each line
(
- a CSV match group start
[\w\s_\-]*
- Valid characters \w (a-zA-Z0-9)
and _
and -
in each CSV
,?
- maybe a comma
)+
- End of the csv match group, 1 to many of these expected.
That will validate a whole file, line by line for a basic CSV structure and allow for empty ,,
situations.
Upvotes: 1
Reputation: 437336
Your regular expression is missing the ^
(beginning of line) and $
(end of line) anchors. This means that it would match any text that contains what is described by the expression, even if the text contains other completely unrelated parts.
For example, this text matches the expression:
foo, bar
and therefore this text also matches:
var result = calculate(foo, bar);
You can see where this is going.
Add ^
at the beginning and $
at the end of csvDataFormat
to get the behavior you expect.
Upvotes: 4