Daniel
Daniel

Reputation: 2500

Need help with regex to extract data inside tags

I have been struggling to create a regex suiting my need for the HTML below for some time. I´m using the java.util.regex.* package, and for different reasons I need to use this package rather than any third party lib.

What I want is to extract the data inside the tags, so the data I want in this particular HTML is 25 / 25, Lindhagen, 0, Spinninghall, 35 and Test Person.

Is it possible to create a regex for this?

<div id="rsv_detail">
  <hr />

  <label>Bokningsstatus</label>
  <span>&nbsp;</span>

  <label>Bokningar</label>

  <span>25 / 25 &nbsp;</span>

  <br />

  <label>Plats</label>
  <span>Lindhagen&nbsp;</span>

  <label>Anlänt</label>
  <span>0&nbsp;</span>

  <br />

  <label>Sal</label>
  <span>Spinninghall&nbsp;</span>

  <label>Max antal</label>
  <span>35&nbsp;</span>
  <br />

  <label>Ledare</label>

  <span>Test Person&nbsp;</span>
  <br /><br />


  <label>Visa mer</label>
  <span>      
    <a href="/index.php?instructors%5B%5D=X129518&amp;func=la&amp;tak=0.36507500+1302460619">Ledare</a>
    <a href="/index.php?locations=LI&amp;func=la&amp;tak=0.36507500+1302460619">Plats</a>
    <a href="/index.php?activities=SP_MEDEL&amp;func=la&amp;tak=0.36507500+1302460619">Aktivitet</a>

  </span>
  <br /><br />

  <br />
  <br />
  <hr />
</div>

Upvotes: 0

Views: 1005

Answers (4)

eyquem
eyquem

Reputation: 27575

'<span>(.*?)&amp;</span>' as a RE will do, won't it ?

Upvotes: 0

Alan Moore
Alan Moore

Reputation: 75222

Pattern p = Pattern.compile("<span>([^<&]+)&nbsp;</span>");
Matcher m = p.matcher(text);
while (m.find())
{
  System.out.println(m.group(1));
}

output:

25 / 25
Lindhagen
0
Spinninghall
35
Test Person

This assumes the target <span> always ends with &nbsp;, and never contains any other entities or elements.

Upvotes: 1

user unknown
user unknown

Reputation: 36229

If you filter out each line which doesn't open and close the span-tag in the same line, you can use:

filtered.replaceAll ("<span>([^<]*)</span>", "$1")
  .replaceAll ("&nbsp;", "")

The paranteheses build a capturing group, which you later reference from left to right by the first paren by number - here it is just one, hence $1. After the opening tag, you read everything except ^ a less-than sign, which you expect to be the closing tag, until the closing tag.

However, in most cases I would agree with stema and Hovercraft Full Of Eels. Pitfalls for regex in html are:

  • Open and close tag are hard to find with regex, if they span over multiple lines, and more so, if they are nested.
  • Tags inside Comments are hard to detect

However there are rare cases, where regexes are useful:

  • One time jobs, where you oversee all coming input.
  • Generated HTML, which will always look the same, from routers for example, or javadocs
  • HTML which you build yourself with your program in mind

Upvotes: 0

Hovercraft Full Of Eels
Hovercraft Full Of Eels

Reputation: 285405

As far as I know, the best way to extract information from HTML is to use an HTML parser or to convert the HTML to XHTML and extract it via standard XML techniques. Why can't you use 3rd party libraries?

Upvotes: 4

Related Questions