M_K
M_K

Reputation: 3455

Regex Pattern matching with HTML tag's

This is only for a small Android program I am messing with so I only need to match one or two tags

I have one HTML tag and I can get whats inside that tag which is "FC-Cologne" I use this code to get it

Pattern pattern = Pattern.compile("report\">(.*?)</a>",Pattern.MULTILINE);

here is the HTML tag I can get to work

<a href="/match-menu/3405570/first-team/fc-cologne=report"> FC Cologne</a>

But I can't get this tag, I don't know is it because of the space after the word "opposition" or/and the quotes inside the HTML tag, because they are not in the first tag

This is the one I can't get to work

<td class="bold opposition "> "Olympiacos" </td>

This is the code I am trying

Pattern pattern = Pattern.compile("opposition \">(.*?)</td>",Pattern.MULTILINE);

I have tried replacing the spaces " " with "" an empty string and I have tried \s where the space is but I get nothing.

I would appreciate if anyone could help me.

Upvotes: 0

Views: 2553

Answers (2)

Tyler Crompton
Tyler Crompton

Reputation: 12662

This is what you're looking for I believe.

<(\w+)\s*(?:\w+(?:=(?:'(?:[^']|(?<=\\)')*'|"(?:[^"]|(?<=\\)")*"))?\s*)*>(.*?)</\1\s*>

You will want to use the second group to get the contents of the tag (the first group is the tag name). Note that this does not work recursively. Nested elements are captured in the second group so you will need to use this regex on the second group of its match until there are no matches if that makes sense.

Upvotes: 0

QuinnG
QuinnG

Reputation: 6424

Unless you have a typo in one of the two - < /td> has a space after the < and in your regex </td> doesn't.

Adding a space to the regex after the < caused the match to succeed in RegexBuddy

Update: Seems the space is not in the tag the OP is working with.

In RegexBuddy I have the pattern (copied as a Java String)

"opposition \">(.*?)</td>"

which matches the html

< td class="bold opposition "> "Olympiacos"       </td>

giving a match of

opposition "> "Olympiacos"       </td>

and Group 1 of

 "Olympiacos"       <--Line ends there.

Upvotes: 2

Related Questions