Jacob Rhodes
Jacob Rhodes

Reputation: 89

HTML Parsing in Android

so here is the problem. I am currently creating an Android app that is going to require me to parse some html so i can display it on the app screen.

I don't know how to do that properly and was wondering if you guys could point me in the right direction or show me a good guide.

What I want to do is go through the html code and take out certain items (specifically the food items, you will see in a minute). I don't want to just link the person to the website or use webview to display the webpage in the app cause I personally feel like that doesn't look good. What i want to do is pull the food items from the html and then just put that part on my app in the form of a string or something.

-----Here is a bit of the html from the site I am using for reference------

enter code here

<a href="http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m784&amp;MI=122&amp;RN=CEREAL  HOT  GRITS" OnClick="javascript: NewWindow('http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m784&amp;MI=122&amp;RN=CEREAL  HOT  GRITS', 'RDA_window',  'width=450, height=600, scrollbars=no, toolbar=no,  directories=no, status=no, menubar=no, copyhistory=no');return false" Class="recipeLink">CEREAL  HOT  GRITS</a>

                <br>

              </td>

            </tr>

          </table>

        </div>

      </td>

    </tr>

    <tr>

      <td>

        <div class="menuTxt">

          <table cellpadding="0" cellspacing="0" border="0" bordercolor="green">

            <tr valign="top">

              <td colspan="3">

                <a href="http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m860&amp;MI=122&amp;RN=PANCAKES  BUTTERMILK" OnClick="javascript: NewWindow('http://www.campusdish.com/en-US/CSMA/OldDominion/Locations/rda.aspx?RCN=m860&amp;MI=122&amp;RN=PANCAKES  BUTTERMILK', 'RDA_window',  'width=450, height=600, scrollbars=no, toolbar=no,  directories=no, status=no, menubar=no, copyhistory=no');return false" Class="recipeLink">PANCAKES  BUTTERMILK  </a>

------end html-------

So I want to just extract the words "CEREAL HOT GRITS" and "PANCAKES BUTTERMILK" for example.

Please and thank you for your help!

Upvotes: 0

Views: 2134

Answers (4)

Jimmy
Jimmy

Reputation: 16428

I would recommend JSoup, I've used it on a few android projects and its been incredibly reliable, I don't have any complaints over it.

As the example says on the JSoup website :

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

You can use the select() to pull out whatever data you need

Pay particular attention to the selectors, from the above example you appear to just want the food names, so you can get them from the <a> tags using something like this :

Elements resultLinks = doc.select("a");

Another tip, drop in a breakpoint right after you've created the Document, then use the expression builder in your IDE to snoop around and work out which elements you need.

Upvotes: 1

user949300
user949300

Reputation: 15729

There's (at least) two reasonable approaches.

1) Use a real HTML parser. (@you786 suggested this) I'm most familiar with Jsoup, but @CommonsWare mentioned a link to some others. You then methodically go through the HTML tree to find what you want. This works best if the HTML is reasonably well formed and structured, and retains that form and structure over time.

2) Just "leap" to what you want. (@Odiefrom suggested this) In your example, search (use String.indexOf()) for "<a href", then search from there for "RN=" then grab all the text up to the next ". This works best of the HTML structure is a huge mess or you don't want to bother figuring it out. (e.g., they overused tables and what you want is about 22 levels down, yes, I've seen this!), and if the text to search for is very distinctive and unique for your information. You probably want to do a little extra "sanity checking" of the text in this case.

Upvotes: 0

you786
you786

Reputation: 3550

Simple: You should use the JSoup library.

Upvotes: 0

Odiefrom
Odiefrom

Reputation: 40

It might not be the most efficient way, but if you take the HTML source code and put it in a string, and then parse through it that way line by line. Whenever you hit a line with <a href at the beginning, then you can check it, and see if it is a food item (don't know how you'd do that without know the rest of the links, but there is probably a different structure or something, or food items might start after link 7 or something. Websites usually have a recognizable pattern). If it is a food item, then grab the link (for the image) and the name, or whatever you need.

Upvotes: 0

Related Questions