Reputation: 28771
I have to extract summary from newspaper article . The summary is extracted based on given Keyword and according to below mentioned rules .
Summary should be of 200 characters.
Start printing from that sentence in article as soon as keyword appears in that sentence and print upto 200 characters
If the matching sentence occurs towards ending of article such that summary is coming out to be less than 200 characters , then move back from matching sentence towards previous sentences uptill finally 200 charcters containing matching sentence are printed finally.
What I have done untill now is ...
var regex = new Regex(keyword+@"(.{0,200})");
foreach (Match match in regex.Matches(input))
{
var result = match.Groups[1].Value;
Console.WriteLine(result);
// work with the result
}
The above code successfully reaches the first matching sentence but starts printing AFTER the keyword upto 200 characters rather than beginning of matching sentence.
Also there is no backtracking if end of article is reached before 200 characters are printed.
Please guide me how should I proceed . Even if somebody doesn't know complete solution , PLEASE do help me out in sub parts of question .
Upvotes: 2
Views: 242
Reputation: 1267
Is using regex a requirement? Here's a rough alternative if it's not:
var index = input.IndexOf(keyword) + keyword.Length;
var remaining = input.Length - index;
index = remaining >= 200 ? index : index -= 200 - remaining;
Console.WriteLine(input.Substring(index, 200));
Upvotes: 0
Reputation: 14542
var nextIndex = input.IndexOf(keyword);
while (nextIndex != -1)
{
var index = nextIndex;
// To start the 200chars result from right after the keyword, do instead:
// var index = nextIndex + keyword.Length;
// If you want to stop after you reached the end of the text once:
// var end = false;
if (index + 200 >= input.Length)
{
index = input.Length - 200;
// If you want to stop after you reached the end of the text once:
// var end = true;
}
var result = index < 0 ? input : input.Substring(index, 200);
Console.WriteLine(result);
// If you want to stop after you reached the end of the text once:
// if (end) { break; }
nextIndex = input.IndexOf(keyword, nextIndex + 1);
}
And if you want to search to be case insensitive, just add StringComparison.OrdinalIgnoreCase
as another parameter in both IndexOf
s.
Upvotes: 1
Reputation: 6524
Use this instead,
var regex = new Regex( @"(" + keyword+ @".{0,200})");
This will make sure that the keyword is also included. Otherwise you can also use this
var result = match.Value;
Further you have specified {0,200} so it will match any instance which is of size between 0 and 200 so it will match any number of characters until the end of article is reached. Let me exactly know what you want to achieve in this regard.
If you want the expression to return the result from the start of the sentence, try doing this
var regex = new Regex( @"\.(.+?" + keyword+ @".*)");
But in this case, you will have to manually remove the excess characters as this regular expression tends to fetch more characters then you expected. It will fetch characters from the beginning of the sentence containing the keyword till the end of paragraph.
Upvotes: 0