Scott
Scott

Reputation: 33

what would make my html parsing code more efficient?

this morning I decided I wanted to work on a little project to parse all the gas prices for maverik gas stations into an array. I got most of that working fairly easily, the only part that I feel is "dirty" in my code is the actual parsing of the html to variables. I'm using indexOf and substrings to get to the data I want and I feel that there has to be a cleaner way to do it? Anyways here is my code, it compiles and works great just not as clean as I'd like.

maverik.java contains the main method and the bulk of the code for the project. maverikObj.java contains the getters and setters, constructor and toString methods.

To change the gas station you are getting console data from you can simply change the number in the array println on line 90 of maverik.java. Future revisions will have methods to control what data is displayed based on user requests.

Here is an example HTML with prices:

html4 = "<b>Maverik Store 4</b><br/>5200 Chinden Blvd<br>Boise, ID<br>208-376-0532<br><center><b></b></center><br /><font color=red>Fuel Prices -- Updated every 30 minutes</font><br /><div><div style=\"float: left; width: 70%; text-align:right;\">Adventure Club Card</div><div style=\"float: right; width: 30%; text-align:center;\">Retail</div><br /><div style=\"float: left;width: 30%;\">Unleaded:</div><div style=\"float: left; width: 30%; text-align:center;\"> 3.379</div><div style=\"float: right; width: 30%; text-align:center;\"> 3.399</div><br /><div style=\"float: left;width: 30%;\">Blend 89:</div><div style=\"float: left; width: 30%; text-align:center;\"> 3.469</div><div style=\"float: right; width: 30%; text-align:center;\"> 3.499</div><br /><div style=\"float: left;width: 30%;\">Blend 90:</div><div style=\"float: left; width: 30%; text-align:center;\"> 3.549</div><div style=\"float: right; width: 30%; text-align:center;\"> 3.579</div><br /><div style=\"float: left;width: 30%;\">Premium:</div><div style=\"float: left; width: 30%; text-align:center;\"> 3.599</div><div style=\"float: right; width: 30%; text-align:center;\"> 3.639</div><br /><div style=\"float: left;width: 30%;\">Diesel:</div><div style=\"float: left; width: 30%; text-align:center;\"> 4.039</div><div style=\"float: right; width: 30%; text-align:center;\"> 4.059</div>";

Currently I'm parsing the address, city, state, phone number and all of the 8 gas types possible at each station. (Unleaded, Blend 87,88,89,99, Premium, Diesel). It gets a bit trickier though because some of the html entries do not have all 8 of those listed, most only have 4 or 5 of the 8 possible fuel types. So to parse this data I used two method.

Address, City, State, Phone number are parsed using:

if(line.contains(" = \"<b>Maverik Store")&&!line.contains("Coming Soon!")){ address=splitLine[3].substring(0,splitLine[3].length()-3).replace(" ", " "); city=splitLine[4].substring(0,splitLine[4].length()-7); state=splitLine[4].substring(splitLine[4].length()-5,splitLine[4].length()-3); phone=splitLine[5].substring(0,splitLine[5].length()-3);

Fuel types are parsed using if else statements, using the if statement to record data if its present and the else statement to record a 0.0 double since my constructor requires all fuel types to have some value.

if(line.indexOf("Unleaded:")>0){
    unleaded=Double.parseDouble(line.substring(line.indexOf("Unleaded:")+147, line.indexOf("Unleaded:")+152));
}
else{
    unleaded=0.0;
}

As you can see I use a lot of substrings and indexOf string methods to get the data I want. My fear is that this is an extremely static method of getting the data I want and thus I feel its a really dirty way of doing things. Any tips on how I can clean up my code are appreciated! =)

Upvotes: 0

Views: 339

Answers (3)

michael
michael

Reputation: 9799

Not to put too fine a point on it, but using regular expressions to parse html (or even xml) is the source of all evil in the world today. (Ok, a tiny exaggeration, but only a little.)

There are a number of utilities out there that try to do their best to handle the inherently messy mess that is our modern html. One for Java is "jsoup". For example:

package foo;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class Bar {
  public static void main(String[] args) {
    //Document doc = Jsoup.connect(url).get();
    String html = "<html>...</html>";
    Document doc = Jsoup.parse(html);
    Elements divs = doc.select("div");
    for (Element e : divs) {
       System.out.println(e.text());
    }
  }
}

Then, even given your sample html snippet (a lot is left as an exercise for the reader):

$ java -cp jsoup-1.7.2.jar:.  foo.Bar

Adventure Club Card Retail Unleaded: 3.379 3.399 Blend 89: 3.469 3.499 Blend 90: 3.549 3.579 Premium: 3.599 3.639 Diesel: 4.039 4.059
Adventure Club Card
Retail
Unleaded:
3.379
3.399
Blend 89:
3.469
3.499
Blend 90:
3.549
3.579
Premium:
3.599
3.639
Diesel:
4.039
4.059

Upvotes: 1

Martin Bories
Martin Bories

Reputation: 1117

Okay, at first, I use a totally different coding style that is more beautiful (according to my opinion). But I would recomment you looking for some different coding styles and decide what you like most.

I've had similar issues with a XML file and it worked out to be quite a mess. Best thing you could do is write an own XMLParser, and as HTML doesn't differ from the XML structure you could use it for parsing HTML files as well.

As it is quite hard work I could give you my implementation (tell me if you want it, it's OpenSource, of course). It is designed to get the developer fast to what he wants. Usage example:

XMLDocument document = new XMLDocument("yourXMLSourceCode");
XMLNode node = document.getNode("html.body.div");
String attribute = document.get("html.body.div?id");
String content = document.get("html.body.div.input");
XMLNode[] mynodes = document.getNode("html.body").getSubNodes("input");

You might find other solutions by searching for "SAX parser" or "XML parser" like this.

I think you could use it, do a few little tricks with that code and you can perfectly use it for HTML.

Otherwise, what I did while working with HTML, you could use an HTMLParser. I've got very good experience with Jsoup.

Upvotes: 4

Mikhail Vladimirov
Mikhail Vladimirov

Reputation: 13890

You can use regular expressions like this:

BufferedReader reader = new BufferedReader (
    new InputStreamReader (
        new URL ("https://www.maverik.com/locations/").
            openStream ()));

Pattern linePattern = Pattern.compile ("<b>Maverik Store ([^<]*)</b><br/>([^<]*)<br>([^<]*)<br>([^<]*)<br><center><b></b></center><br /><font color=red>Fuel Prices -- Updated every 30 minutes</font>");
Pattern pricePattern = Pattern.compile ("<div style=\\\\\"float: left;width: 30%;\\\\\">([^<]*)</div><div style=\\\\\"float: left; width: 30%; text-align:center;\\\\\">([^<]*)</div><div style=\\\\\"float: right; width: 30%; text-align:center;\\\\\">([^<]*)</div><br />");

String line;
while ((line = reader.readLine ()) != null)
{
    Matcher lineMatcher = linePattern.matcher (line);
    if (lineMatcher.find ())
    {
        System.out.println ("Store #: " + lineMatcher.group (1));
        System.out.println ("Store Address 1: " + lineMatcher.group (2));
        System.out.println ("Store Address 2: " + lineMatcher.group (3));
        System.out.println ("Store Phone: " + lineMatcher.group (4));

        Matcher priceMatcher = pricePattern.matcher (line);
        while (priceMatcher.find ())
        {
            System.out.println (priceMatcher.group (1) + priceMatcher.group (2) + priceMatcher.group (3));
        }
        System.out.println ();
    }
}

For me it outputs:

Store #: 4
Store Address 1: 5200 Chinden Blvd
Store Address 2: Boise, ID
Store Phone: 208-376-0532
Unleaded: 3.379 3.399
Blend 89: 3.469 3.499
Blend 90: 3.549 3.579
Premium: 3.599 3.639

Store #: 6
Store Address 1: 8561 West State
Store Address 2: Boise, ID
Store Phone: 208-853-1226
Unleaded: 3.379 3.399
Blend 88: 3.849 3.879
Blend 89: 3.469 3.499
Blend 90: 3.549 3.579

Store #: 7
Store Address 1: Highway   310  North
Store Address 2: Bridger, MT
Store Phone: 406-662-3356
Unleaded: 3.249 3.269
Blend 87: 3.499 3.529
Blend 89: 3.499 3.529
Premium: 3.489 3.529

Store #: 130
Store Address 1: 105  South  200  West
Store Address 2: Bountiful, UT
Store Phone: 801-292-6792
Unleaded: 3.269 3.289
Blend 87: 3.359 3.389
Blend 89: 3.439 3.469

Store #: 134
Store Address 1: 105  East Winnemucca
Store Address 2: Winnemucca, NV
Store Phone: 775-623-5948
Unleaded: 3.559 3.579
Blend 87: 3.649 3.679
Blend 89: 3.729 3.759

Store #: 135
Store Address 1: 1571  North  Main
Store Address 2: Sheridan, WY
Store Phone: 307-672-7010
Unleaded: 3.159 3.179

Store #: 136
Store Address 1: 222  South  Main
Store Address 2: Lyman, WY
Store Phone: 307-786-2705
Unleaded: 3.269 3.289
Blend 87: 3.359 3.389
Blend 89: 3.439 3.469
Premium: 3.489 3.529

Store #: 137
Store Address 1: 7th  & Main
Store Address 2: Snowflake, AZ
Store Phone: 928-536-7511
Unleaded: 3.539 3.559
Blend 89: 3.629 3.659
Blend 90: 3.709 3.739

...

Upvotes: -1

Related Questions