Reputation: 1498
I am using the following Regex
JOINTS.*\s*(?:(\d*\s*\S*\s*\S*\s*\S*)\r\n\s*)*
on the following type of data:
JOINTS DISPL.-X DISPL.-Y ROTATION
1 0.000000E+00 0.975415E+01 0.616921E+01
2 0.000000E+00 0.000000E+00 0.000000E+00
The idea is to extract two groups, each containing a line (starting with the Joint Number, 1, 2, etc.) The C# code is as follows:
string jointPattern = @"JOINTS.*\s*(?:(\d*\s*\S*\s*\S*\s*\S*)\r\n\s*)*";
MatchCollection mc = Regex.Matches(outFileSection, jointPattern );
foreach (Capture c in mc[0].Captures)
{
JointOutput j = new JointOutput();
string[] vals = c.Value.Split();
j.Joint = int.Parse(vals[0]) - 1;
j.XDisplacement = float.Parse(vals[1]);
j.YDisplacement = float.Parse(vals[2]);
j.Rotation = float.Parse(vals[3]);
joints.Add(j);
}
However, this does not work: rather than returning two captured groups (the inside group), it returns one group: the entire block, including the column headers. Why does this happen? Does C# deal with un-captured groups differently?
Finally, are RegExes the best way to do this? (I really do feel like I have two problems now.)
Upvotes: 16
Views: 20409
Reputation: 75252
mc[0].Captures
is equivalent to mc[0].Groups[0].Captures
. Groups[0]
always refers to the whole match, so there will only ever be the one Capture associated with it. The part you're looking for is captured in group #1, so you should be using mc[0].Groups[1].Captures
.
But your regex is designed to match the whole input in one attempt, so the Matches()
method will always return a MatchCollection with only one Match in it (assuming the match is successful). You might as well use Match()
instead:
Match m = Regex.Match(source, jointPattern);
if (m.Success)
{
foreach (Capture c in m.Groups[1].Captures)
{
Console.WriteLine(c.Value);
}
}
output:
1 0.000000E+00 0.975415E+01 0.616921E+01
2 0.000000E+00 0.000000E+00 0.000000E+00
Upvotes: 14
Reputation: 98886
There's two problems: The repeating part (?:...)
is not matching properly; and the .*
is greedy and consumes all the input, so the repeating part never matches even if it could.
Use this instead:
JOINTS.*?[\r\n]+(?:\s*(\d+\s*\S*\s*\S*\s*\S*)[\r\n\s]*)*
This has a non-greedy leading part, ensures that the line-matching part starts on a new line (not in the middle of a title), and uses [\r\n\s]*
in case the newlines are not exactly as you expect.
Personally, I would use regexes for this, but I like regexes :-) If you happen to know that the structure of the string will always be [title][newline][newline][lines] then perhaps it's more straightforward (if less flexible) to just split on newlines and process accordingly.
Finally, you can use regex101.com or one of the many other regex testing sites to help debug your regular expressions.
Upvotes: 1
Reputation: 31721
Why not just capture the values and ignore the rest. Here is a regex which gets the values.
string data = @"JOINTS DISPL.-X DISPL.-Y ROTATION
1 0.000000E+00 0.975415E+01 0.616921E+01
2 0.000000E+00 0.000000E+00 0.000000E+00";
string pattern = @"^
\s+
(?<Joint>\d+)
\s+
(?<ValX>[^\s]+)
\s+
(?<ValY>[^\s]+)
\s+
(?<Rotation>[^\s]+)";
var result = Regex.Matches(data, pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
.OfType<Match>()
.Select (mt => new
{
Joint = mt.Groups["Joint"].Value,
ValX = mt.Groups["ValX"].Value,
ValY = mt.Groups["ValY"].Value,
Rotation = mt.Groups["Rotation"].Value,
});
/* result is
IEnumerable<> (2 items)
Joint ValX ValY Rotation
1 0.000000E+00 0.975415E+01 0.616921E+01
2 0.000000E+00 0.000000E+00 0.000000E+00
*/
Upvotes: 3
Reputation: 21275
I would just not use Regex
for heavy lifting and parse the text.
var data = @" JOINTS DISPL.-X DISPL.-Y ROTATION
1 0.000000E+00 0.975415E+01 0.616921E+01
2 0.000000E+00 0.000000E+00 0.000000E+00";
var lines = data.Split('\r', '\n').Where(s => !string.IsNullOrWhiteSpace(s));
var regex = new Regex(@"(\S+)");
var dataItems = lines.Select(s => regex.Matches(s)).Select(m => m.Cast<Match>().Select(c => c.Value));
Upvotes: 3