Reputation: 29
Is there a way to extract text surrounding anchor tags in an html page? I am working in java and my research needs me to extract data in and around tags. I have tried searching and all I've found is regular expressions to extract only the anchor text and not the words around it.
Upvotes: 1
Views: 374
Reputation:
Regex is not the way to go to parse html, but ..
Quick and dirty, if you have to have a regex
"([^<>]*)<a>([^<>]*)</a>([^<>]*)"
( [^<>]* ) # (1)
<a>
( [^<>]* ) # (2)
</a>
( [^<>]* ) # (3)
"is there a way to provide the number of characters before and after the anchor text
"?
Sure. You can supply either min/max {m,n}
or exact {exact}
or a mixture.
Example:
Before = 5, after = 5 to 10
"([^<>]{5})<a>([^<>]*)</a>([^<>]{5,10})"
Before = 1 to no-limit, after = 0 to 10
"([^<>]{1,})<a>([^<>]*)</a>([^<>]{0,10})"
And there are many other possible variations, including mixing literals in as well.
Upvotes: 1