Mike Lowery
Mike Lowery

Reputation: 2868

Trying to parse links in an HTML directory listing using Java regex

Ok I know everyone is going to tell me not to use RegEx for parsing HTML, but I'm programming on Android and don't have ready access to an HTML parser (that I'm aware of). Besides, this is server generated HTML which should be more consistent than user-generated HTML.

The regex looks like this:

Pattern patternMP3 = Pattern.compile(
        "<A HREF=\"[^\"]+.+\\.mp3</A>",
        Pattern.CASE_INSENSITIVE |
        Pattern.UNICODE_CASE);
Matcher matcherMP3 = patternMP3.matcher(HTML);
while (matcherMP3.find()) { ... }

The input HTML is all on one line, which is causing the problem. When the HTML is on separate lines this pattern works. Any suggestions?

Upvotes: 2

Views: 1205

Answers (3)

Jim Blackler
Jim Blackler

Reputation: 23179

For your information, on Android you can parse HTML 'properly' with a combination of org.cyberneko.html.parsers.SAXParser, org.xml.sax.* and org.dom4j.*.

http://sourceforge.net/projects/nekohtml

http://www.saxproject.org

http://dom4j.sourceforge.net

Upvotes: 0

Jens
Jens

Reputation: 25593

The regex

"<A HREF=\"([^\"]+)\"[^>]*>([^<]+?)\\.mp3</A>"

should match your links, and have the link and the filename in its groups. Note though, that the argument of href does not neccesarily need to be enclosed in quotes in html. (Or, if it needs to be, neither browsers nor developers know that =). )

Upvotes: 1

CWF
CWF

Reputation: 2147

You shouldn't be matching '.+' since you've already got [^\"]+ (which is better for your particular situation).

Try:

"<A HREF=\"[^\"]+\\.mp3\"</A>"

Also, don't forget the double-quote after the mp3.

Upvotes: 0

Related Questions