How to create JSOUP selector for the following

Question

For example I want to extract the text in this article HTML:

    
            
              
            
                        The Spark of Genius Series highlights a unique feature of startups and is made possible by Microsoft BizSpark. If you would like to have your startup considered for inclusion, please see the details here.

Each weekend, Mashable hand-picks startups we think are building interesting, unique or niche products. 
This week, we’ve rounded up startups making mobile applications that bridge the physical and digital worlds for improved communication and enhanced experiences. 
TransFire breaks down global communication barriers with its instant and automatic translation capabilities, while Babbleville facilitates neighbor-to-neighbor communication around events or topics. And, Picdish uses time and place to bring friends together over shared mobile food experiences.

And I have another HTML page I want to extract text from too, but its in different format. I want to extract this text from http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2

How would I go about creating a selector to extract the text no matter which article url is given?

BalusC · Accepted Answer

How would I go about creating a selector to extract the text no matter which article url is given?

You can't. All websites have their own HTML structure. Open the page in the webbrowser yourself, rightclick and View Source. Look. You should create a separate selector for each individual website.

For your first example, assuming that it's the whole HTML, the text is thus inside those

tags. You can then use

Document html = Jsoup.parse(yourHtmlString);
Elements paragraphs = html.select("p");
String text = paragraphs.text();
// ...

For your CNN site, according the HTML source you'd like to get all

s of the

, so this selector should do:

Document document = Jsoup.connect("http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2").get();
Elements paragraphs = document.select(".cnn_strycntntlft p");
String text = paragraphs.text();
// ...

By the way, it would be easier to just use their RSS feeds instead of parsing the whole HTML. Lot of news sites provides RSS feeds for exactly this purpose.

How to create JSOUP selector for the following

Answers (1)

Related Questions