Reputation: 2452
The objective of my c# app is to extract 2 decimal values (latitude, longtitude) from a text document. I tried to apply a template to pick up those numerals. It is an older app with Framework-3.5 platform.
using System.Text.RegularExpressions;
String BB = "<span style=\"font-family:"Times","serif"\">\r\n<i>Lat</i>: 29.48434, <i>Long</i>: -81.562445 <o:p></o:p></span></p>\r\n</td>\r\n</tr>\r\n</tbody>\r\n</table>\r\n<p class=\"MsoNormal\"><span style=\"font-family:"Times","serif"\"><o:p> </o:p></span></p>\r\n<table class=\"MsoNormalTable\" border=\"0\" cellpadding=\"0\">\r\n<tbody>\r\n<tr>\r\n<td style=\"padding:.75pt .75pt .75pt .75pt\">\r\n<p class=\"MsoNormal\"><b><span style=\"font-family:"Times","serif"\">Coordinates:</span></b><span style=\"font-family:"Times","serif"\">\r\n<i>Lat</i>: 29.48434, <i>Long</i>: -81.562445 <o:p></o:p></span></p>\r\n</td>";
string p2 = @".*Lat\D+(-*[0-9]+\.[0-9]+)\D+Lon\D+(-*[0-9]+\.[0-9]+)";
Console.WriteLine(p2);
foreach (Match collection in Regex.Matches(BB, p2)) {
foreach ( Group gp in collection.Groups) {
Console.WriteLine("Match group {0}", gp.Value);
}
}
I expected the output of Group[2] should have the '-' sign before 81.562445 but it looks like it has dropped it even it matches the template "(-*[0-9]+.[0-9]+)" !!! Is there anything I can do to make the group show with the '-' sign?
Upvotes: 0
Views: 59
Reputation: 920
Your pattern is looking for non-digit characters (\D+
) before the latitude and longitude values and the -
is not a digit so it is captured. To make the non-digit match non-greedy, use the a ?
after the sequence (\D+?
) making your final pattern
string p2 = @".*Lat\D+?(-?[0-9]+\.[0-9]+)\D+Lon\D+?(-?[0-9]+\.[0-9]+)";
As for the comment about parsing the html node instead of matching with a regex, this is generally better but in this case it doesn't really gain you much as the inner text of the relevant elements turn out to be
"\r\nLat: 29.48434, Long: -81.562445 "
and
"\r\n\r\n\r\n\r\nCoordinates:\r\nLat: 29.48434, Long: -81.562445 \r\n"
both of which require similar amounts of massaging to tease out the required data, likely with a regex anyway, unless an exact match can be expected with the remaining content.
Upvotes: 2