Reputation: 4524
I am writing one regex to find rows which matches the Unicode char in text file
!Regex.IsMatch(colCount.line, @"^"[\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]"+$")
below is the full code which I have written
var _fileName = @"C:\text.txt";
BadLinesLst = File
.ReadLines(_fileName, Encoding.UTF8)
.Select((line, index) =>
{
var count = line.Count(c => Delimiter == c) + 1;
if (NumberOfColumns < 0)
NumberOfColumns = count;
return new
{
line = line,
count = count,
index = index
};
})
.Where(colCount => colCount.count != NumberOfColumns || (Regex.IsMatch(colCount.line, @"[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]")))
.Select(colCount => colCount.line).ToList();
File contains below rows
264162-03,66,JITK,2007,12,874.000 ,0.000 ,0.000
6420œ50-00,67,JITK,2007,12,2292.000 ,0.000 ,0.000
4804¥75-00,67,JITK,2007,12,1810.000 ,0.000 ,0.000
If file of row contains any other char apart from BasicLatin or LatinExtended-A or LatinExtended-B then I need to get those rows. The above Regex is not working properly, this is showing those rows as well which contains LatinExtended-A or B
Upvotes: 1
Views: 573
Reputation: 626747
You need to just put the Unicode category classes into a negated character class:
if (Regex.IsMatch(colCount.line,
@"[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]"))
{ /* Do sth here */ }
This regex will find partial matches (since the Regex.IsMatch
finds pattern matches inside larger strings). The pattern will match any character other than the one in \p{IsBasicLatin}
, \p{IsLatinExtended-A}
and \p{IsLatinExtended-B}
Unicode category sets.
You may also want to check the following code:
if (Regex.IsMatch(colCount.line,
@"^[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]*$"))
{ /* Do sth here */ }
This will return true if the whole colCount.line
string does not contain any character from the 3 Unicode category classes specified in the negated character class -or- if the string is empty (if you want to disallow fetching empty strings, replace *
with +
at the end).
Upvotes: 1