Reputation: 47
What should be regex pattern that split words in the tagged sentence which are follow in this order
(B-NP)-(B-VP)-(B-NP)-(I-NP) or (B-NP)-(I-NP)-(B-VP)-(B-NP)-(I-NP).
Sentence example:
(B-SBAR)After(B-SBAR) (B-NP)Chuck(B-NP) (I-NP)and(I-NP) (I-NP)David(I-NP) (B-VP)leave(B-VP) (B-NP)the(B-NP) (I-NP)gang(I-NP) (O),(O) (B-NP)the(B-NP) (I-NP)remaining(I-NP) (I-NP)group(I-NP) (B-ADVP)also(B-ADVP) (B-VP)split(B-VP) (B-PRT)up(B-PRT) (B-NP)into(B-NP) (I-NP)2(I-NP) (I-NP)groups(I-NP) (B-PP)of(B-PP) (B-NP)2(B-NP) (O)and(O) (B-VP)get(B-VP) (I-VP)to(I-VP) (I-VP)know(I-VP) (B-NP)each(B-NP) (I-NP)other(I-NP) (I-NP)a(I-NP) (I-NP)little(I-NP) (I-NP)better(I-NP) (O).(O)
Should be splitted:
Upvotes: 0
Views: 56
Reputation: 627327
Actually, you need to use named captured groups (fortunately, in .NET, regex supports several named groups with the same name).
var str = "(B-SBAR)After(B-SBAR) (B-NP)Chuck(B-NP) (I-NP)and(I-NP) (I-NP)David(I-NP) (B-VP)leave(B-VP) (B-NP)the(B-NP) (I-NP)gang(I-NP) (O),(O) (B-NP)the(B-NP) (I-NP)remaining(I-NP) (I-NP)group(I-NP) (B-ADVP)also(B-ADVP) (B-VP)split(B-VP) (B-PRT)up(B-PRT) (B-NP)into(B-NP) (I-NP)2(I-NP) (I-NP)groups(I-NP) (B-PP)of(B-PP) (B-NP)2(B-NP) (O)and(O) (B-VP)get(B-VP) (I-VP)to(I-VP) (I-VP)know(I-VP) (B-NP)each(B-NP) (I-NP)other(I-NP) (I-NP)a(I-NP) (I-NP)little(I-NP) (I-NP)better(I-NP) (O).(O)";
var rx = new Regex(@"(?<FstTag>\(B-NP\))(?<FstWrd>\w+)\k<FstTag>.*?(?<SndTag>\(B-VP\))(?<SndWrd>\w+)\k<SndTag>.*?(?<TrdTag>\(B-NP\))(?<TrdWrd>\w+)\k<TrdTag>.*?(?<FthTag>\(I-NP\))(?<FthWrd>\w+)\k<FthTag>|(?<FstTag>\(B-NP\))(?<FstWrd>\w+)\k<FstTag>.*?(?<SndTag>\(I-NP\))(?<SndWrd>\w+)\k<SndTag>.*?(?<TrdTag>\(B-VP\))(?<TrdWrd>\w+)\k<TrdTag>.*?(?<FthTag>\(B-NP\))(?<FthWrd>\w+)\k<FthTag>.*?(?<FfhTag>\(I-NP\))(?<FfhWrd>\w+)\k<FfhTag>");
var ms = rx.Matches(str).Cast<Match>().Select(p => p.Groups["FstWrd"].Value + " " + p.Groups["SndWrd"].Value + " " + p.Groups["TrdWrd"].Value + " " + p.Groups["FthWrd"].Value + " " + p.Groups["FfhWrd"].Value).ToList();
Upvotes: 1