Sturm
Sturm

Reputation: 4125

Mapping Regex matches to original string

The following sentence

I have a red car

Can be transformed to this string

Pronoun Verb Determiner Adjective Noun

What I want is to find parts of the original sentence that are noun phrases (NP). A simple pattern for NP is (Determiner)*(Adjective)*(Noun) (where * means that the group may appear zero or more times). Actual regex is:

public static string Regex = "((?:Determiner.?)*(?:Adjective.?)*(?:Noun.?))";

Using the following code it is possible to extract all NPs:

        MatchCollection NPmatches = Regex.Matches(structure, NounPhrase.Regex); 
        foreach(Match match in NPmatches)
        {
            foreach (Capture NPcapture in match.Captures)
            {
                Console.WriteLine(NPcapture.Value.ToString());
            }
        }

Output would be:

Determiner Adjective Noun

What I really need is the part of the original sentence corresponding to that structure (NP), in this case it would be

a red car

I can figure out somehow where the regex match is located, and count number of words from there, but it is messy and error prone. It would be great if that could be done using some LINQ expression combined with regex, in order to keep in scope the source of the transformation. Any thoughs?

PS. A sentence is transformed to types using this code

RawSentence.Split(new char[] {' '}, StringSplitOptions.RemoveEmptyEntries).Select(i=>i.Type.ToString()).Aggregate((x,y) => x + " " + y);

Upvotes: 1

Views: 589

Answers (1)

Josh Heitz
Josh Heitz

Reputation: 68

I think you will need more than just a mapping from your original sentence to the words "Pronoun", "Verb", "Determiner", "Adjective", and "Noun". You did indicate that some parts of speech (i.e. your determiners, adjectives, and nouns) may occur zero or more times. If they appear more than once, then even if you did have a mapping from the original sentence down to your parts of speech, you wouldn't be able to get back to the original text because you would then have a one-to-many relationship. You would instead need to label your determiners, adjectives, and nouns uniquely, such as determiner1, determiner2, adjective1, noun1, noun2, noun3, etc. Once you have your unique mappings, you can go either direction with ease.

Upvotes: 1

Related Questions