Ermias Asghedom
Ermias Asghedom

Reputation: 95

How do I use regex in Java to pull this from html?

I'm trying to pull data from the ESPN box scores, and one of the html files has:

<td style="text-align:left" nowrap><a href="http://espn.go.com/nba/player/_/id/2754/channing-frye">Channing Frye</a>, PF</td>

and I'm only interested in grabbing the name (Channing Frye) and the position (PF)

Right now, I've been using Pattern.quote(start) + "(.*?)" + Pattern.quote(end) to grab text in between start and end, but I'm not sure how I'm supposed to grab text that starts with pattern .../http://espn.go.com/nba/player/_/id/ and then can contain (any integer)/anyfirst-anylast"> then grab the name I need (Channing Frye), then </a>, and then grab the position I need (PF) and ends with pattern </td>

Thanks!

Upvotes: 0

Views: 82

Answers (5)

rvd
rvd

Reputation: 337

You can use :

String lString = "<td style=\"text-align:left\" nowrap><a href=\"http://espn.go.com/nba/player/_/id/2754/channing-frye\">Channing Frye</a>, PF</td>";
Pattern lPattern = Pattern.compile("<td.+><a.+id/\\d+/.+\\-.+>(.+)</a>, (.+)</td>");
Matcher lMatcher = lPattern.matcher(lString);
while(lMatcher.find()) {
    System.out.println(lMatcher.group(1));
    System.out.println(lMatcher.group(2));
}

This will give you :

Channing Frye
PF

Upvotes: 0

Sn.
Sn.

Reputation: 87

Here is one regex:

  • . is used for any item, .+ is used for any 1+ items
  • .* means o or more items
  • \s is used for space

    String str = "<td style=\"text-align:left\" nowrap><a href=\"http://espn.go.com/nba/player/_/id/2754/channing-frye\">Channing Frye</a>, PF</td>";
    Pattern pattern = Pattern.compile("<td.+>.*<a.+>(.+)</a>[\\s,]+(.+)</td>");
    Matcher matcher = pattern.matcher(str);
    
    while(matcher.find()){
        System.out.println(matcher.group(1));
        System.out.println(matcher.group(2));
    }
    

Upvotes: 1

Amit Joki
Amit Joki

Reputation: 59292

Use this regex:

[A-Z\sa-z0-9]+(?=</a>)|\w+(?=</td>)

Upvotes: 1

l&#39;L&#39;l
l&#39;L&#39;l

Reputation: 47282

You could use this pattern:

\\/nba\\/player\\/_\\/.*\\\">(.*)<.+>,\\s(.*)<

This will match any link in the html that contains `/nba/player/

String re = "\\/nba\\/player\\/_\\/.*\\">(.*)<.+>,\\s(.*)<";
String str = "<td style=\"text-align:left\" nowrap><a href=\"http://espn.go.com/nba/player/_/id/2754/channing-frye\">Channing Frye</a>, PF</td>";

Pattern p = Pattern.compile(re, Pattern.MULTILINE | Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(str);

example: http://regex101.com/r/hA3uV0

Upvotes: 1

Ilya I
Ilya I

Reputation: 1282

Here is the pattern:

http://espn.go.com/nba/player/_/id/(\d+)/([\w-]+)">(.*?)</a>,\s*(\w+)</td>

You can use this tool - http://www.regexplanet.com/advanced/java/index.html for verifying regular expressions.

Upvotes: 2

Related Questions