Reputation: 29

regular expression to extract text surrounding anchor tag from html page

Is there a way to extract text surrounding anchor tags in an html page? I am working in java and my research needs me to extract data in and around tags. I have tried searching and all I've found is regular expressions to extract only the anchor text and not the words around it.

Upvotes: 1

Answers (1)

user557597

Reputation:

Regex is not the way to go to parse html, but ..
Quick and dirty, if you have to have a regex

"([^<>]*)<a>([^<>]*)</a>([^<>]*)"

 ( [^<>]* )         # (1)
 <a>
 ( [^<>]* )         # (2)
 </a>
 ( [^<>]* )         # (3)

"is there a way to provide the number of characters before and after the anchor text"?

Sure. You can supply either min/max {m,n} or exact {exact} or a mixture.
Example:

Before = 5, after = 5 to 10
"([^<>]{5})<a>([^<>]*)</a>([^<>]{5,10})"

Before = 1 to no-limit, after = 0 to 10
"([^<>]{1,})<a>([^<>]*)</a>([^<>]{0,10})"

And there are many other possible variations, including mixing literals in as well.

Upvotes: 1

regular expression to extract text surrounding anchor tag from html page

Answers (1)

Related Questions