Reputation: 26971
For example I want to extract the text in this article HTML:
<div class="description">
<div style="clear: none;" class="post-fb-like">
<fb:like class=" fb_edge_widget_with_comment fb_iframe_widget" href="http://mashable.com/2011/08/07/3-handy-mobile-apps/" send="true" width="625" height="61"><span><iframe src="http://www.facebook.com/plugins/like.php?api_key=116628718381794&channel_url=http%3A%2F%2Fstatic.ak.fbcdn.net%2Fconnect%2Fxd_proxy.php%3Fversion%3D3%23cb%3Df138585052991e8%26origin%3Dhttp%253A%252F%252Fmashable.com%252Ff15a8eb75cc2b58%26relation%3Dparent.parent%26transport%3Dpostmessage&href=http%3A%2F%2Fmashable.com%2F2011%2F08%2F07%2F3-handy-mobile-apps%2F&layout=standard&locale=en_US&node_type=link&sdk=joey&send=true&show_faces=true&width=625" class="fb_ltr" title="Like this content on Facebook." style="border: medium none; overflow: hidden; height: 29px; width: 625px;" name="f2d40595a65cf36" id="f24fece5e565ec4" scrolling="no"></iframe></span></fb:like>
</div>
<p><img src="http://ec.mashable.com/wp-content/uploads/2009/01/bizspark2.gif" alt="" align="left"><em>The <a href="http://mashable.com/tag/bizspark">Spark of Genius Series</a> highlights a unique feature of startups and is made possible by <a rel="nofollow" href="http://www.microsoftstartupzone.com/BizSpark/Pages/At_a_Glance.aspx?WT.mc_id=MSZ_Mashable_posts" target="_blank">Microsoft BizSpark</a>. If you would like to have your startup considered for inclusion, please see the details <a href="http://mashable.com/bizspark/">here</a>.</em></p>
<p><img src="http://5.mshcdn.com/wp-content/uploads/2011/08/mobile-devices.jpg" alt="" title="mobile devices" class="alignright" height="141" width="225">Each <a href="http://mashable.com/follow/topics/startup-weekend-roundup">weekend</a>, <em>Mashable</em> hand-picks startups we think are building interesting, unique or niche products. </p>
<p>This week, we’ve rounded up startups making mobile applications that bridge the physical and digital worlds for improved communication and enhanced experiences. </p>
<p>TransFire breaks down global communication barriers with its instant and automatic translation capabilities, while Babbleville facilitates neighbor-to-neighbor communication around events or topics. And, Picdish uses time and place to bring friends together over shared mobile food experiences.</p>
<hr>
And I have another HTML page I want to extract text from too, but its in different format. I want to extract this text from http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2
How would I go about creating a selector to extract the text no matter which article url is given?
Upvotes: 0
Views: 1405
Reputation: 1108712
How would I go about creating a selector to extract the text no matter which article url is given?
You can't. All websites have their own HTML structure. Open the page in the webbrowser yourself, rightclick and View Source. Look. You should create a separate selector for each individual website.
For your first example, assuming that it's the whole HTML, the text is thus inside those <p>
tags. You can then use
Document html = Jsoup.parse(yourHtmlString);
Elements paragraphs = html.select("p");
String text = paragraphs.text();
// ...
For your CNN site, according the HTML source you'd like to get all <p>
s of the <div class="cnn_strycntntlft">
, so this selector should do:
Document document = Jsoup.connect("http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2").get();
Elements paragraphs = document.select(".cnn_strycntntlft p");
String text = paragraphs.text();
// ...
By the way, it would be easier to just use their RSS feeds instead of parsing the whole HTML. Lot of news sites provides RSS feeds for exactly this purpose.
Upvotes: 1