Mudassir Hasan
Mudassir Hasan

Reputation: 28771

Extract matching text using regular expression

I have to extract summary from newspaper article . The summary is extracted based on given Keyword and according to below mentioned rules .

  1. Summary should be of 200 characters.

  2. Start printing from that sentence in article as soon as keyword appears in that sentence and print upto 200 characters

  3. If the matching sentence occurs towards ending of article such that summary is coming out to be less than 200 characters , then move back from matching sentence towards previous sentences uptill finally 200 charcters containing matching sentence are printed finally.

What I have done untill now is ...

var regex = new Regex(keyword+@"(.{0,200})");

foreach (Match match in regex.Matches(input))
{
    var result = match.Groups[1].Value;
    Console.WriteLine(result);

    // work with the result
}

The above code successfully reaches the first matching sentence but starts printing AFTER the keyword upto 200 characters rather than beginning of matching sentence.

Also there is no backtracking if end of article is reached before 200 characters are printed.

Please guide me how should I proceed . Even if somebody doesn't know complete solution , PLEASE do help me out in sub parts of question .

Upvotes: 2

Views: 242

Answers (3)

Zac Charles
Zac Charles

Reputation: 1267

Is using regex a requirement? Here's a rough alternative if it's not:

var index = input.IndexOf(keyword) + keyword.Length;
var remaining = input.Length - index;
index = remaining >= 200 ? index : index -= 200 - remaining;

Console.WriteLine(input.Substring(index, 200));

Upvotes: 0

YoryeNathan
YoryeNathan

Reputation: 14542

var nextIndex = input.IndexOf(keyword);

while (nextIndex != -1)
{
    var index = nextIndex;
    // To start the 200chars result from right after the keyword, do instead:
    // var index = nextIndex + keyword.Length;

    // If you want to stop after you reached the end of the text once:
    // var end = false;

    if (index + 200 >= input.Length)
    {
        index = input.Length - 200;

        // If you want to stop after you reached the end of the text once:
        // var end = true;
    }

    var result = index < 0 ? input : input.Substring(index, 200);

    Console.WriteLine(result);

    // If you want to stop after you reached the end of the text once:
    // if (end) { break; }

    nextIndex = input.IndexOf(keyword, nextIndex + 1);
}

And if you want to search to be case insensitive, just add StringComparison.OrdinalIgnoreCase as another parameter in both IndexOfs.

Upvotes: 1

Murtuza Kabul
Murtuza Kabul

Reputation: 6524

Use this instead,

var regex = new Regex( @"(" + keyword+ @".{0,200})");

This will make sure that the keyword is also included. Otherwise you can also use this

var result = match.Value;

Further you have specified {0,200} so it will match any instance which is of size between 0 and 200 so it will match any number of characters until the end of article is reached. Let me exactly know what you want to achieve in this regard.

If you want the expression to return the result from the start of the sentence, try doing this

var regex = new Regex( @"\.(.+?" + keyword+ @".*)");

But in this case, you will have to manually remove the excess characters as this regular expression tends to fetch more characters then you expected. It will fetch characters from the beginning of the sentence containing the keyword till the end of paragraph.

Upvotes: 0

Related Questions