Havoux
Havoux

Reputation: 165

C# Regex syntax help for parsing string

I have this

var regex = new Regex(@"StartDate:(.*)EndDate:(.*)W.*Status:(.*)");

So this gets me values until it hits a W in the string correct? - I need it to stop at a W OR S. I have tried a few different ways but I am not getting it to work. Anyone got some info?

More info:

            record = record.Replace(" ", "").Replace("\r\n", "").Replace("-", "/");
            var regex = new Regex(@"StartDate:(.*)EndDate:(.*)W.*Status:(.*)");
            string strStartDate = regex.Match(record).Groups[1].ToString();
            string strEndDate = regex.Match(record).Groups[2].ToString();
            string Status = regex.Match(record).Groups[3].ToString().ToUpper().StartsWith("In") ? "Inactive" : "Active";

I am trying to parse a big string of values, I only want 3 things - Start Date, End Date, and Status (active/inactive). However there are 3 different values for each (3 start dates, 3 end dates, 3 status')

First 2 string go like this

"Start Date: 

 2014-09-08 



End Date: 

 2017-09-07 



Warranty Type: 

 XXX 



Status: 

 Active 



Serial Number/IMEI: 

 XXXXXXXXXXX









Description:



XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

The 3rd string is like this

"Start Date: 

 2014-09-08 



End Date: 

 2017-09-07 



Status: 

 Active 



Warranty Upgrade Code:



SVC_PRIORITY"

On the last string it will not display the dates because of the W.* after end date im guessing

I am not getting the 2 dates on the last string

Upvotes: 1

Views: 125

Answers (4)

buckley
buckley

Reputation: 14079

No need to replace the new lines in your example

List<string> resultList = new List<string>();

var subjectString = @"Start Date: xxxxx
End Date: yyyy
Warranty Type: zzzz
Status: uuuu
Start Date: aaaa
End Date: bbbb
Status: cccc";

Regex regexObj = new Regex(@"Start Date: (.*?)\nEnd Date: (.*?)\n(.|\n)*?Status: (.*)");
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Groups[1].Value);
    resultList.Add(matchResult.Groups[2].Value);
    resultList.Add(matchResult.Groups[4].Value);
    matchResult = matchResult.NextMatch();
} 

Upvotes: 1

ΩmegaMan
ΩmegaMan

Reputation: 31616

Avoid .* its a catch all which gets regex pattern creators in trouble. Instead create the pattern to match to a specific pattern in the data which always occurs in the data.

Your pattern are the two dates of \d\d\d\d-\d\d-\d\d\d\d the rest is anchor text, which should be used as static anchors which can be skipped.

Here is an example where it looks for the date patterns. Once found regex puts it into named match capture groups (?<GroupNameHere>...) and Linq extracts each match into a dynamic entity and parses the date times.

Data

Note the first date is reversed as per your example

var data = @"Start Date:

 2014-09-08

End Date:

 2017-09-07

Status:

 Active

Start Date:

 2014-09-09

End Date:

 2017-09-10

Status:

 In-Active
 ";

Pattern

string pattern = @"
^Start\sDate:\s+                     # An anchor of start date that always starts at the BOL
(?<Start>\d\d\d\d-\d\d-\d\d)         # actual start date pattern
\s+                                  # a lot of space including \r\n
^End\sDate:\s+                       # End date anchor and space
(?<End>\d\d\d\d-\d\d-\d\d)           # pattern of the end date.
\s+                                  # Same pattern as above for Status
^Status:\s+
(?<Status>[^\s]+)
 ";

Processing

// Explicit hints to the parser to ingore any non specified matches ones outside the parenthesis(..)
// Multiline states ^ and $ are beginning and eol lines and not beginning and end of buffer.
// Ignore allows us to comment the pattern only; does not affect processing.
Regex.Matches(data, pattern, RegexOptions.ExplicitCapture |
                             RegexOptions.Multiline       |
                             RegexOptions.IgnorePatternWhitespace)
     .OfType<Match>()
     .Select (mt => new
            {
                Status    = mt.Groups["Status"].Value,
                StartDate = DateTime.Parse(mt.Groups["Start"].Value),
                EndDate   = DateTime.Parse(mt.Groups["End"].Value)
            })

Result

enter image description here

Upvotes: 0

Quinn
Quinn

Reputation: 4504

EDIT Please try the function to parse using regex:

using System.Text.RegularExpressions;
using System.Linq;
using System.Windows.Forms;

private static List<string[]> parseString(string input)
{
    var pattern = @"Start\s+Date:\s+([0-9-]+)\s+End\s+Date:\s+([0-9-]+)\s+(?:Warranty\s+Type:\s+\w+\s+)?Status:\s+(\w+)\s*";
    return Regex.Matches(input, pattern).Cast<Match>().ToList().ConvertAll(m => new string[] { m.Groups[1].Value, m.Groups[2].Value, m.Groups[3].Value });

}

// To show the result string
var result1 = parseString(str1);
string result_string = string.Join("\n", result1.ConvertAll(r => string.Format("Start Date: {0}\nEnd Date: {1}\nStatus: {2}", r)).ToArray());
MessageBox.Show(result_string);

Output:

enter image description here

EDIT2 For OP's situation, you could call the function from inside the foreach loop like this:

foreach (HtmlElement el in webBrowser1.Document.GetElementsByTagName("div"))
{
    if (el.GetAttribute("className") == "fluid-row Borderfluid")
    {
        string record = el.InnerText;
        //if record is the string to parse
        var result = parseString(record);
        var result_string = string.Join("\n", result.ConvertAll(r => string.Format("Start Date: {0}\nEnd Date: {1}\nStatus: {2}", r)).ToArray());
        MessageBox.Show(result_string);
    }
}

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626826

You may replace your code with the following one (see IDEONE demo):

var s = @"Start Date: xxxxx
End Date: xxxx
Warranty Type: xxxx
Status: xxxx";
var res = Regex.Replace(s, @":\s+", ": ")            // Remove excessive whitespace
        .Split(new[] { "\r", "\n" }, StringSplitOptions.RemoveEmptyEntries) // Split each line with `:`+space
        .ToDictionary(n => n[0], n => n[1]);              // Create a dictionary
string strStartDate = string.Empty;
string strEndDate = string.Empty;
string Status = string.Empty;
string Warranty = string.Empty;
// Demo & variable assignment
if (res.ContainsKey("Start Date")) {
    Console.WriteLine(res["Start Date"]);
    strStartDate = res["Start Date"];
}
if (res.ContainsKey("Warranty Type")) {
    Console.WriteLine(res["Warranty Type"]);
    Warranty = res["Warranty Type"];
}
if (res.ContainsKey("End Date")) {
    Console.WriteLine(res["End Date"]);
    strEndDate = res["End Date"];
}
if (res.ContainsKey("Status")) {
    Console.WriteLine(res["Status"]);
    string Status = res["Status"];
}

Note that the best approach is to declare your own class with the fields like WarrantyType, StartDate, etc. and initialize that right in the LINQ code.

Upvotes: 0

Related Questions