Reputation: 2500
I have been struggling to create a regex suiting my need for the HTML below for some time. I´m using the java.util.regex.* package, and for different reasons I need to use this package rather than any third party lib.
What I want is to extract the data inside the tags, so the data I want in this particular HTML is 25 / 25, Lindhagen, 0, Spinninghall, 35 and Test Person.
Is it possible to create a regex for this?
<div id="rsv_detail">
<hr />
<label>Bokningsstatus</label>
<span> </span>
<label>Bokningar</label>
<span>25 / 25 </span>
<br />
<label>Plats</label>
<span>Lindhagen </span>
<label>Anlänt</label>
<span>0 </span>
<br />
<label>Sal</label>
<span>Spinninghall </span>
<label>Max antal</label>
<span>35 </span>
<br />
<label>Ledare</label>
<span>Test Person </span>
<br /><br />
<label>Visa mer</label>
<span>
<a href="/index.php?instructors%5B%5D=X129518&func=la&tak=0.36507500+1302460619">Ledare</a>
<a href="/index.php?locations=LI&func=la&tak=0.36507500+1302460619">Plats</a>
<a href="/index.php?activities=SP_MEDEL&func=la&tak=0.36507500+1302460619">Aktivitet</a>
</span>
<br /><br />
<br />
<br />
<hr />
</div>
Upvotes: 0
Views: 1005
Reputation: 75222
Pattern p = Pattern.compile("<span>([^<&]+) </span>");
Matcher m = p.matcher(text);
while (m.find())
{
System.out.println(m.group(1));
}
output:
25 / 25
Lindhagen
0
Spinninghall
35
Test Person
This assumes the target <span>
always ends with
, and never contains any other entities or elements.
Upvotes: 1
Reputation: 36229
If you filter out each line which doesn't open and close the span-tag in the same line, you can use:
filtered.replaceAll ("<span>([^<]*)</span>", "$1")
.replaceAll (" ", "")
The paranteheses build a capturing group, which you later reference from left to right by the first paren by number - here it is just one, hence $1. After the opening tag, you read everything except ^ a less-than sign, which you expect to be the closing tag, until the closing tag.
However, in most cases I would agree with stema and Hovercraft Full Of Eels. Pitfalls for regex in html are:
However there are rare cases, where regexes are useful:
Upvotes: 0
Reputation: 285405
As far as I know, the best way to extract information from HTML is to use an HTML parser or to convert the HTML to XHTML and extract it via standard XML techniques. Why can't you use 3rd party libraries?
Upvotes: 4