Rocky
Rocky

Reputation: 4524

How to use Unicode in Regex

I am writing one regex to find rows which matches the Unicode char in text file

!Regex.IsMatch(colCount.line, @"^"[\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]"+$")

below is the full code which I have written

var _fileName = @"C:\text.txt";

BadLinesLst = File
              .ReadLines(_fileName, Encoding.UTF8) 
              .Select((line, index) =>
               {
                 var count = line.Count(c => Delimiter == c) + 1;
                     if (NumberOfColumns < 0)
                           NumberOfColumns = count;

                             return new
                             {
                                 line = line,
                                 count = count,
                                 index = index
                             };
               })
               .Where(colCount => colCount.count != NumberOfColumns || (Regex.IsMatch(colCount.line, @"[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]")))
               .Select(colCount => colCount.line).ToList();

File contains below rows

264162-03,66,JITK,2007,12,874.000 ,0.000 ,0.000

6420œ50-00,67,JITK,2007,12,2292.000 ,0.000 ,0.000

4804¥75-00,67,JITK,2007,12,1810.000 ,0.000 ,0.000

If file of row contains any other char apart from BasicLatin or LatinExtended-A or LatinExtended-B then I need to get those rows. The above Regex is not working properly, this is showing those rows as well which contains LatinExtended-A or B

Upvotes: 1

Views: 573

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

You need to just put the Unicode category classes into a negated character class:

if (Regex.IsMatch(colCount.line, 
         @"[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]")) 
{ /* Do sth here */ }

This regex will find partial matches (since the Regex.IsMatch finds pattern matches inside larger strings). The pattern will match any character other than the one in \p{IsBasicLatin}, \p{IsLatinExtended-A} and \p{IsLatinExtended-B} Unicode category sets.

You may also want to check the following code:

if (Regex.IsMatch(colCount.line, 
     @"^[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]*$")) 
{ /* Do sth here */ }

This will return true if the whole colCount.line string does not contain any character from the 3 Unicode category classes specified in the negated character class -or- if the string is empty (if you want to disallow fetching empty strings, replace * with + at the end).

Upvotes: 1

Related Questions