meghana
meghana

Reputation: 907

Regex Fragment for getting highlight

I want solr highlight in specific format.

Below is string format for which i need to provide highlighting feature

130s: LISTEN! LISTEN! 138s: [THUMP] 143s: WHAT IS THAT? 144s: HEAR THAT?
152s: EVERYBODY, SHH. SHH. 156s: STAY UP THERE. 163s: [BOAT CREAKING] 165s:
WHAT IS THAT? 167s: [SCREAMING] 191s: COME ON! 192s: OH, GOD! 193s: AAH!
249s: OK. WE'VE HAD SOME PROBLEMS 253s: AT THE FACILITY. 253s: WHAT WE'RE
ATTEMPTING TO ACHIEVE 256s: HERE HAS NEVER BEEN DONE. 256s: WE'RE THIS CLOSE
259s: TO THE REACTIVATION 259s: OF A HUMAN BRAIN CELL. 260s: DOCTOR, THE 200
MILLION 264s: I'VE SUNK INTO THIS COMPANY 264s: IS DUE IN GREAT PART 266s:
TO YOUR RESEARCH.

after user search I want to provide user fragment in below format

Previous Line of Highlight + Line containing Highlight + Next Line of
Highlight

For. E.g. user searched for term hear , then one typical highlight fragment should be like below

<str>143s: WHAT IS THAT? 144s: <em>HEAR</em> THAT? 152s: EVERYBODY, SHH.
SHH.</str>

above is my ultimate plan , but right now I am trying to get fragment as, which start with ns: where n is numner between 0 to 9999

i use hl.regex.slop = 0.6 and my hl.fragsize=120 and below is regex for that.

\b(?=\s*\d{1,4}s:){50,200} 

using above regular expression my fragment always do not start with ns:

Please suggest me on this , how can i achieve ultimate plan

Thanks

Upvotes: 2

Views: 583

Answers (1)

DWright
DWright

Reputation: 9500

You might be able to greatly simplify your approach (much less complicated regex would be required) by temporarily splitting the text you are searching into lines at every ns.

Example

130s: LISTEN! LISTEN!
138s: [THUMP]
143s: WHAT IS THAT?
144s: HEAR THAT?
152s: EVERYBODY, SHH. SHH.

Then do the regex search, which gets simpler:

(^\d{1,4})(s: .*?)(SEARCHPATTERN)(.*)

Then grab the preceding line and the following line (in this case SEARCHPATTERN is HEAR). To make finding the preceding and following line quicker (without having to backtrack and search forward), you could populate a hashmap with all the \d{1,4} line beginnings keyed to their line numbers.

hashmap with line numbers (my notation is conceptual only)

"130" => 1
"138" => 2
"143" => 3
"144" => 4
"152" => 5

Your regex tells you that the search word is on the line beginning with 144 (group 1 in regex), which your hashmap tells you is line 4, so you know that you have to get lines 3 and 5 in addition to the groups matched by the regex.

Result = <str>line3 + \1 + \2 + <em>\3</em>\4 + line5</str>

Note: I'm not a solr user, so my regular expression syntax and the example result string should be taken as the general idea. I don't know if solr has its own notation.

Upvotes: 1

Related Questions