Reputation: 1018
I am using preg_match_all to search a specified keyword in a string and if it found then, I pick few words before and after that keyword. I am using below preg_match_all
preg_match_all('~\b(?:[^ ]+ ){0,'.$prev.'}'.trim($keyword).'(?: [^ ]+){0,'.$next.'}\b~i',$text,$output);
here $keyword is a keyword, $prev and $next are numbers representing how many words need to pick , $text is a main string and $output are the resultant array. So if my string is below
PROFIT & LOSS NOFORMING P 152 22. ADDITIONAL INFORMATION: A) AUDITORS REMUNERATION (EXCLUDING SERVICE TAX) (` in crores) ParticularsCurrent yearPrevious year As audit fees (including limited review) 3.45 2.42
Here keyword is "Audit Fee", I get desired output, like this
EXCLUDING SERVICE TAX) (` in crores) ParticularsCurrent yearPrevious year As audit fees (including limited review) 3.45 2.42
But in a below string, where if my keyword and next word have no spaces in between it just returns few words before that string but not the next word after that keyword.
PROFIT & LOSS NOFORMING P 152 22. ADDITIONAL INFORMATION: A) AUDITORS REMUNERATION (EXCLUDING SERVICE TAX) (` in crores) ParticularsCurrent yearPrevious year As audit fees(including limited review) 3.45 2.42
It just returns
EXCLUDING SERVICE TAX) (` in crores) ParticularsCurrent yearPrevious year As audit fees
Kindly guide me here, how to get next words also if my keyword and it's next word has no space in between them.
Upvotes: 1
Views: 39
Reputation: 626861
If you only are worried about the words after the keyword, you need to make sure you match space characters (or non-word chars) that should be optional (zero or more):
'~\b(?:\S+\s+){0,10}Audit Fees(?:\s*\S+){0,5}\b~'
See this regex demo
This will let the whitespaces between the non-whitespace chunks after the keyword optional (\s*
matches zero or more whitespaces).
Pattern details:
\b
- leading word boundary (?:\S+\s+){0,10}
- zero to ten 1+ non-whitespace symbols followed with 1+ whitespacesAudit Fees
- literal keyword(?:\s*\S+){0,5}
- zero to five 0+ whitespace symbols followed with 1+ non-whitespace symbols\b
- trailing word boundary$prev = 10;
$keyword = "Audit Fee";
$next = 5;
$text= "PROFIT & LOSS NOFORMING P 152 22. ADDITIONAL INFORMATION: A) AUDITORS REMUNERATION (EXCLUDING SERVICE TAX) (` in crores) ParticularsCurrent yearPrevious year As audit fees(including limited review) 3.45 2.42";
$re = '~\b(?:\S+\s+){0,'.$prev.'}'.trim($keyword).'(?:\s*\S+){0,'.$next.'}\b~i';
preg_match_all($re,$text,$output);
print_r($output);
Upvotes: 1