kashif4u
kashif4u

Reputation: 175

Splitting string on conjoining words

I am required to split few strings in arrays based on conjoining words i.e. on, in, from etc.

string sampleString = "what was total sales for pencils from Japan in 1999";

Desired result:

what was total sales

for pencils

from japan 

in 1999

I am familiar with splitting string based on one word but not multiple at the same time:

string[] stringArray = sampleString.Split(new string[] {"of"}, StringSplitOptions.None);

Any suggestions?

Upvotes: 1

Views: 77

Answers (1)

Lasse V. Karlsen
Lasse V. Karlsen

Reputation: 391306

For this particular scenario you can use Regular Expressions to do this.

You will have to use something called a lookahead pattern, because otherwise the words you're splitting on would be removed from the results.

Here's a small LINQPad program that demonstrates:

void Main()
{
    string sampleString = "what was total sales for pencils from Japan in 1999";
    Regex.Split(sampleString, @"\b(?=of|for|in|from)\b").Dump();
}

Output:

what was total sales  
for pencils  
from Japan  
in 1999 

But, as I said in the comments, it's going to be tripped up by things like the name of places that contain any of the words you split on, so:

string sampleString = "what was total sales for pencils from the Isle of Islay in 1999";
Regex.Split(sampleString, @"\b(?=of|for|in|from)\b").Dump();

Output:

what was total sales  
for pencils  
from the Isle  
of Islay  
in 1999 

The regular expression can be rewritten like this to be more expressive for future maintenance:

Regex.Split(sampleString, @"
    \b          # Must be a word boundary here
                # makes sure we don't match words that contain the split words, like 'fortune'
    (?=         # lookahead group, will match, but not be consumed/zero length
        of      # List of words, separated by the OR operator, |
        |for
        |in
        |from
    )
    \b          # Also a word boundary", RegexOptions.IgnorePatternWhitespace).Dump();

You might also want to add RegexOptions.IgnoreCase to the options, to match "Of" and "OF", etc.

Upvotes: 5

Related Questions