coder_For_Life22
coder_For_Life22

Reputation: 26971

How to create JSOUP selector for the following

For example I want to extract the text in this article HTML:

    <div class="description">
            <div style="clear: none;" class="post-fb-like">
              <fb:like class=" fb_edge_widget_with_comment fb_iframe_widget" href="http://mashable.com/2011/08/07/3-handy-mobile-apps/" send="true" width="625" height="61"><span><iframe src="http://www.facebook.com/plugins/like.php?api_key=116628718381794&amp;channel_url=http%3A%2F%2Fstatic.ak.fbcdn.net%2Fconnect%2Fxd_proxy.php%3Fversion%3D3%23cb%3Df138585052991e8%26origin%3Dhttp%253A%252F%252Fmashable.com%252Ff15a8eb75cc2b58%26relation%3Dparent.parent%26transport%3Dpostmessage&amp;href=http%3A%2F%2Fmashable.com%2F2011%2F08%2F07%2F3-handy-mobile-apps%2F&amp;layout=standard&amp;locale=en_US&amp;node_type=link&amp;sdk=joey&amp;send=true&amp;show_faces=true&amp;width=625" class="fb_ltr" title="Like this content on Facebook." style="border: medium none; overflow: hidden; height: 29px; width: 625px;" name="f2d40595a65cf36" id="f24fece5e565ec4" scrolling="no"></iframe></span></fb:like>
            </div>
                        <p><img src="http://ec.mashable.com/wp-content/uploads/2009/01/bizspark2.gif" alt="" align="left"><em>The <a href="http://mashable.com/tag/bizspark">Spark of Genius Series</a> highlights a unique feature of startups and is made possible by <a rel="nofollow" href="http://www.microsoftstartupzone.com/BizSpark/Pages/At_a_Glance.aspx?WT.mc_id=MSZ_Mashable_posts" target="_blank">Microsoft BizSpark</a>. If you would like to have your startup considered for inclusion, please see the details <a href="http://mashable.com/bizspark/">here</a>.</em></p>

<p><img src="http://5.mshcdn.com/wp-content/uploads/2011/08/mobile-devices.jpg" alt="" title="mobile devices" class="alignright" height="141" width="225">Each <a href="http://mashable.com/follow/topics/startup-weekend-roundup">weekend</a>, <em>Mashable</em> hand-picks startups we think are building interesting, unique or niche products. </p>
<p>This week, we’ve rounded up startups making mobile applications that bridge the physical and digital worlds for improved communication and enhanced experiences. </p>
<p>TransFire breaks down global communication barriers with its instant and automatic translation capabilities, while Babbleville facilitates neighbor-to-neighbor communication around events or topics. And, Picdish uses time and place to bring friends together over shared mobile food experiences.</p>
<hr>

And I have another HTML page I want to extract text from too, but its in different format. I want to extract this text from http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2

How would I go about creating a selector to extract the text no matter which article url is given?

Upvotes: 0

Views: 1405

Answers (1)

BalusC
BalusC

Reputation: 1108712

How would I go about creating a selector to extract the text no matter which article url is given?

You can't. All websites have their own HTML structure. Open the page in the webbrowser yourself, rightclick and View Source. Look. You should create a separate selector for each individual website.

For your first example, assuming that it's the whole HTML, the text is thus inside those <p> tags. You can then use

Document html = Jsoup.parse(yourHtmlString);
Elements paragraphs = html.select("p");
String text = paragraphs.text();
// ...

For your CNN site, according the HTML source you'd like to get all <p>s of the <div class="cnn_strycntntlft">, so this selector should do:

Document document = Jsoup.connect("http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2").get();
Elements paragraphs = document.select(".cnn_strycntntlft p");
String text = paragraphs.text();
// ...

By the way, it would be easier to just use their RSS feeds instead of parsing the whole HTML. Lot of news sites provides RSS feeds for exactly this purpose.

Upvotes: 1

Related Questions