Regex that match different format sentences in c#

Question

Format of file

POS ID         PosScore NegScore    SynsetTerms                          Gloss
a   00001740    0.125   0           able#1"                              able to swim"; "she was able to program her computer";
a   00002098    0       0.75        unable#1                            "unable to get to town without a car"; 
a   00002312    0       0           dorsal#2 abaxial#1                  "the abaxial surface of a leaf is the underside or side facing away from the stem"
a   00002843    0       0           basiscopic#1                         facing or on the side toward the base
a   00002956    0       0.23        abducting#1 abducent#1               especially of muscles; drawing away from the midline of the body or from an adjacent part
a   00003131    0       0           adductive#1 adducting#1 adducent#1   especially of muscles;

In this file, I want to extract (ID,PosScore,NegScore and SynsetTerms) field. The (ID,PosScore,NegScore) field data extraction is easy and I use the following code for the data of these fields.

Regex expression = new Regex(@"(	(\d+)|(\w+)	)");

var results = expression.Matches(input);
foreach (Match match in results)
{

    Console.WriteLine(match);
}
Console.ReadLine();

and it give the correct result but the Filed SynsetTerms create a problem because some lines have two or more words so how organize word and get against it PosScore And NegScore.

For example, in fifth line there are two words abducting#1 and abducent#1 but both have same score.

So what will be regex for such line that get Word and its score, like:

  Word                PosScore          NegScore 
  abducting#1         0                 0.23
  abducent#1          0                 0.23

Austin Salonen · Accepted Answer

The non-regex, string-splitting version might be easier:

var data =
   lines.Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
        .Skip(1)
        .Select(line => line.Split('	'))
        .SelectMany(parts => parts[4].Split().Select(word => new
            {
                ID = parts[1],
                Word = word,
                PosScore = decimal.Parse(parts[2]),
                NegScore = decimal.Parse(parts[3])
            }));

Regex that match different format sentences in c#

Answers (2)

Related Questions