BlazingFrog
BlazingFrog

Reputation: 2415

Regex to extract multiple sentences while discarding a specific one

In: preferences = 'Hello my name is paul. I hate puzzles.'
I want to extract Hello my name is paul.

In: preferences = 'Salutations my name is richard. I love pizza. I hate rain.'
I want to extract Salutations my name is richard. I love pizza.

In: preferences = 'Hi my name is bob. I enjoy ice cream.'
I want to extract Hi my name is bob. I enjoy ice cream.

In other words, I would like to

My problem is that my regex stops at the first . and doesn't extract the subsequent sentences.

Thanks.

Upvotes: 2

Views: 484

Answers (4)

user557597
user557597

Reputation:

One of these might work -

Results in Match[1] buffer

preferences\s*=\s*'([^']*?)(?:(?<=[.'])[^.']*hate[^.']*\.\s*)?'

or

Results in Match[1] buffer

preferences\s*=\s*'([^']*?)(?=(?<=[.'])[^.']*hate[^.']*\.\s*'|')

or

(.Net only)
Results in Match[0] buffer

(?<=preferences\s*=\s*')[^']*?(?=(?<=[.'])[^.']*hate[^.']*\.\s*'|')

edit: Not using \b around 'hate', nor begin/end constructs ^$, feel free to add them if thats what you need. As a side not, its puzzling how apostrophe and period are used in the context of delimiting a string variable that has free form text in it.

Upvotes: 0

Cheeso
Cheeso

Reputation: 192487

I did it with 2 regex. The first is used to strip the preferences = '...', and the second is to eliminate any sentence with the word "hate". The 2nd regex uses a positive lookbehind to replace setntences with the keyword with the empty string.

String[] tests = {
    "preferences = 'Hello my name is Paul. I hate puzzles.'",
    "preferences = 'Salutations my name is Richard. I love pizza. I hate rain.'",
    "preferences = 'Hi my name is Bob. Regex turns me on.'"};
var re1 = new Regex("preferences = '(.*)'");
var re2 = new Regex("([^\\.]+(?<=.*\\bhate\\b.*)).\\s*");

for (int i=0; i < tests.Length; i++)
{
    Console.WriteLine("{0}: {1}", i, tests[i]);
    var m = re1.Match(tests[i]);
    if (m.Success)
    {
        var s = m.Groups[1].ToString();
        s = re2.Replace(s,"");
        Console.WriteLine("   {1}", i, s);
    }
    Console.WriteLine();
}

This may not be exactly what you want, since you asked to eliminate only the last sentence if it contains the flag word. But it's easy to adjust if you truly want to strip only the last sentence if it contains the word. In that case you just need to append a $ to the end of re2.

Upvotes: 1

Kobi
Kobi

Reputation: 138017

You can achieve what you want using a regular expression:

^preferences\s*=\s*'(.*?\.)(?:[^.]*\bhate\b[^.]*\.)?'$

That one isn't too tricky:

  • (.*?\.) - Match your expected output, that will be captured in group $1. The pattern matches "sentences" (as you've defined), but lazily (*?), as few as it must.
  • (?:[^.]*\bhate\b[^.]*\.)? - optionally match the last sentence, but only if it contains "hate". If it can match, and it is the last sentence, the matching engine will not backtrack, and the last sentence will not be included in the captured group.

Here's a working example in Rubular: http://www.rubular.com/r/qTuMmB3ySj
(I've added \r\n in a few places, to avoid [^.] matching new lines)

Honestly though, you can do better than a single regular expression here, if you can avoid it.

Upvotes: 2

John Bartels
John Bartels

Reputation: 2763

While This is not using RegEx, it will achieve what you are aiming for

List<string> resultsList = new List<string);


for(int i = 0; i < preferences.Count; i++)
{
    List<string> tempList = new List<string);
    //creating the substring eliminates the "preferences = '" as well as the "'" at end of string
    //this line also splits each string from the preferences string list into the tempList array
    tempList = preferences[i].Substring(15, preferences[i].Length - 15 - 1).Split('.').ToList();

    string buildFinalString = "";

    //traverse tempList and only add string to buildFinalString if it does not contain "hate"
    foreach(string x in tempList)
    {
        if(!x.Contains("hate").ToUpper() || !x.Contains("hate").ToLower())
        {
             buildFinalString = buildFinalString + " " + x;
        }
    }
    resultsList.Add(buildFinalString);
}

or if you only wanted to check the last string in the "tempList" array for the word hate, this would also be possible...

Upvotes: 1

Related Questions