Please can someone help me parse these links from an HTML page http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299 http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154 http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158 I want to parse using the " handle " word which is common in these links. I'm using the command [Pattern pattern = Pattern.compile("<a.+href=\"(.+?)\"");] but it parse me all the href links of the page. Any suggestions? Thanks

Reputation: 261

Trying to parse links in an HTML directory listing using Java

Please can someone help me parse these links from an HTML page

http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299
http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154
http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158

I want to parse using the "handle" word which is common in these links.

I'm using the command [Pattern pattern = Pattern.compile("<a.+href=\"(.+?)\"");] but it parse me all the href links of the page.

Any suggestions?
Thanks

Upvotes: 1

Answers (2)

Lucas de Oliveira

Reputation: 1632

Looks like your regex is doing something wrong. Instead of

Pattern pattern = Pattern.compile("<a.+href=\"(.+?)\"");

Try:

Pattern pattern = Pattern.compile("<a\\s+href=\"(.+?)\"");

the 'a.+' on your first pattern is matching any character at least one time. If you intended to set the space character the use '\s+' instead.

The following code works perfect:

    String s = "<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299\"/> " +
            "<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154\" /> " +
            "<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158\"/>";

    Pattern p = Pattern.compile("<a\\s+href=\"(.+?)\"", Pattern.MULTILINE);
    Matcher m = p.matcher(s); 
    while(m.find()){
        System.out.println(m.start()+" : "+m.group(1));
    }

output:

0 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299
72 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154
145 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158

Upvotes: 1

Jared

Reputation: 2474

Your regular expression is looking at ALL <a href... tags. "handle" is always used as "/dspace/handle" etc. so you can use something like this to scrape the urls you're looking for:

Pattern pattern = Pattern.compile("<a.+href=\"(/dspace/handle/.+?)\"");

Upvotes: 2

Trying to parse links in an HTML directory listing using Java

Answers (2)

Related Questions