Fran Rodriguez
Fran Rodriguez

Reputation: 1

get text between html tags

Possible duplicate: RegEx matching HTML tags and extracting text

I need to get the text between the html tag like <p></p> or whatever. My pattern is this

Pattern pText = Pattern.compile(">([^>|^<]*?)<");

Anyone knows some better pattern, because this one its not very usefull. I need it to get for index the content from web page.

Thanks

Upvotes: 0

Views: 2561

Answers (3)

Guffa
Guffa

Reputation: 700222

It looks like you are trying to use the | operator inside a negative set, which is neither working nor needed. Just specify the characters that you don't want to match:

Pattern pText = Pattern.compile(">([^<>]*?)<");

Upvotes: 3

Welbog
Welbog

Reputation: 60398

Don't use regular expressions when parsing HTML.

Use XPath instead (if your HTML is well formed). You can reference text nodes using the text() function very easily.

Upvotes: 2

danben
danben

Reputation: 83230

SO is about to descend on you. But let me be the first to say, don't use regular expressions to parse HTML. Here is a list of Java HTML Parsers. Look around until you see an API that suits your fancy and use that instead.

Upvotes: 5

Related Questions