Reputation: 235
Format of file
POS ID PosScore NegScore SynsetTerms Gloss
a 00001740 0.125 0 able#1" able to swim"; "she was able to program her computer";
a 00002098 0 0.75 unable#1 "unable to get to town without a car";
a 00002312 0 0 dorsal#2 abaxial#1 "the abaxial surface of a leaf is the underside or side facing away from the stem"
a 00002843 0 0 basiscopic#1 facing or on the side toward the base
a 00002956 0 0.23 abducting#1 abducent#1 especially of muscles; drawing away from the midline of the body or from an adjacent part
a 00003131 0 0 adductive#1 adducting#1 adducent#1 especially of muscles;
In this file, I want to extract (ID,PosScore,NegScore and SynsetTerms) field. The (ID,PosScore,NegScore) field data extraction is easy and I use the following code for the data of these fields.
Regex expression = new Regex(@"(\t(\d+)|(\w+)\t)");
var results = expression.Matches(input);
foreach (Match match in results)
{
Console.WriteLine(match);
}
Console.ReadLine();
and it give the correct result but the Filed SynsetTerms create a problem because some lines have two or more words so how organize word and get against it PosScore And NegScore.
For example, in fifth line there are two words abducting#1
and abducent#1
but both have same score.
So what will be regex for such line that get Word and its score, like:
Word PosScore NegScore
abducting#1 0 0.23
abducent#1 0 0.23
Upvotes: 2
Views: 227
Reputation: 32787
You can use this regex
^(?<pos>\w+)\s+(?<id>\d+)\s+(?<pscore>\d+(?:\.\d+)?)\s+(?<nscore>\d+(?:\.\d+)?)\s+(?<terms>(?:.*?#[^\s]*)+)\s+(?<gloss>.*)$
You can create a list like this
var lst=Regex.Matches(input,regex)
.Cast<Match>()
.Select(x=>
new
{
pos=x.Groups["pos"].Value,
terms=Regex.Split(x.Groups["terms"].Value,@"\s+"),
gloss=x.Groups["gloss"].Value
}
);
and now you can iterate over it
foreach(var temp in lst)
{
temp.pos;
//you can now iterate over terms
foreach(var t in temp.terms)
{
}
}
Upvotes: 1
Reputation: 50215
The non-regex, string-splitting version might be easier:
var data =
lines.Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
.Skip(1)
.Select(line => line.Split('\t'))
.SelectMany(parts => parts[4].Split().Select(word => new
{
ID = parts[1],
Word = word,
PosScore = decimal.Parse(parts[2]),
NegScore = decimal.Parse(parts[3])
}));
Upvotes: 5