ayush
ayush

Reputation: 14568

c# regular expression

I have an output like -

Col.A              Col.B  Col.C  Col.D
--------------------------------------------------------------
* 1  S60-01-GE-44T-AC   SGFM115001195  7520051202   A
  1  S60-PWR-AC         APFM115101302  7520047802   A
  1  S60-PWR-AC         APFM115101245  7520047802   A

or

 Col.A               Col.B  Col.C  Col.D
--------------------------------------------------------------
* 0  S50-01-GE-48T-AC   DL252040175    7590005605   B
  0  S50-PWR-AC         N/A            N/A          N/A
  0  S50-FAN            N/A            N/A          N/A

For these outputs the regular expression -

(?:\*)?\s+(?<unitno>\d+)\s+\S+-\d+-(?:GE|TE)?-?(?:\d+(?:F|T))-?(?:(?:AC)|V)?\s+(?<serial>\S+)\s+\S+\s+\S+\s+\n

works fine to capture Column A and Column B. But recently I got a new kind of output -

 Col.A               Col.B  Col.C  Col.D  
---------------------------------------------------------
* 0  S4810-01-64F       HADL120620060  7590009602   A        
  0  S4810-PWR-AC       H6DL120620060  7590008502   A          
  0  S4810-FAN          N/A            N/A          N/A         
  0  S4810-FAN          N/A            N/A          N/A  

As you can see the patterns "GE|TE" and the "AC|V" are missing from these outputs. How do I change my regular expression accordingly maintaining backward compatibility.

EDIT:

The output that you see comes in a complete string and due to some operational limits I cannot use any other concept other than regex here to get my desired values. I know using split would be ideal here but I cannot.

Upvotes: 0

Views: 497

Answers (5)

Olivier Jacot-Descombes
Olivier Jacot-Descombes

Reputation: 112324

A regular expression seems not to be the right approach here. Use a positional approach

string s = "* 0  S4810-01-64F       HADL120620060  7590009602   A";

bool withStar = s[0] == '*';
string nr = s.Substring(2, 2).Trim();
string colA = s.Substring(5, 18).TrimEnd();
string colB = s.Substring(24, 14).TrimEnd();
...

UPDATE

I you want (or must) stick to Regex, test for the spaces instead of the values. Of cause this works only if the values never include spaces.

string[] result = Regex.Split(s, "\s+");

Of cause you can also search for non-spaces \S instead of \s.

MatchCollection matches = Regex.Matches(s, "\S+");

or excluding the star

(?:\*)?[^*\s]+

Upvotes: 2

shf301
shf301

Reputation: 31394

You are probably better off using String.Split() to break the column values out into sperate strings and then processing them, rather that using a huge un-readable regular expression.

foreach (string line in lines) {
    string[] colunnValues = line.Split((char[])null, StringSplitOptions.RemoveEmptyEntries);
    ...
}

Upvotes: 2

vane
vane

Reputation: 2215

Why not try something like this (?:\*)?\s+(?<unitno>\d+)\s+\S+\s+(?<serial>\S+)\s+\S+\s+\S+(?:\s+)?\n

This is built off your provided regular expression and due to the trailing \n the provided input will need to end with a carriage return.

Upvotes: 1

SAJ14SAJ
SAJ14SAJ

Reputation: 1708

I would not use regular expressions to parse these reports.

Instead, treat them as fixed column width reports after the headers are stripped off.

I would do something like (this is typed cold as an example, not tested even for syntax):

   // Leaving off all public/private/error detection stuff
   class ColumnDef  
   {
        string Name { set; get; } 
        int FirstCol { set; get; }
        int LastCol { set; get; }
   }

   ColumnDef[] report = new ColumnDef[] 
   {
         { Name = "ColA",
           FirstCol = 0,
           LastCol = 2
         },
         /// ... and so on for each column
   }

   IDictionary<string, string> ParseDataLine(string line) 
   {
       var dummy = new Dictionary<string, string>();
       foreach (var c in report) 
       {
          dummy[c.Name] = line.Substring(c.FirstCol, c.LastCol).Trim();
       }
   }

This is an example of a generic ETL (Extract, Transform, and Load) problem--specifically the Extract stage.

You will have to strip out header and footer lines before using ParseDataLine, and I am not sure there is enough information shown to do that. Based on what your post says, any line that is blank, or doesn't start with a space or a * is a header/footer line to be ignored.

Upvotes: 1

your regular expression doesn't even need GE or TE. See that ? after (?:GE|TE)?

that means that the previous group or symbol is optional.

the same is true with the AC and V section

Upvotes: 1

Related Questions