rena-c
rena-c

Reputation: 325

How to extract texts between <p> tags

I want to extract texts from HTML page(s) which placed in p and li tags, so I can start to tokenize the page to construct inverted index(es) for each page in order to answer search queries.

How I can get p tags using jsoup

Elements e = doc.select(""); 

What could be the string to be written in that parameter?

Upvotes: 10

Views: 31657

Answers (3)

PANKAJ MALI
PANKAJ MALI

Reputation: 1

Try this:

File input = new File("/home/s5/Downloads/PDFCopy/PDs.html");
        Document doc = Jsoup.parse(input, "UTF-8","http://www.cisco.com/c/en/us/products/collateral/wireless/aironet-1815-series-access-points/datasheet-c78-738481.pdf");
        Elements link = doc.select("p");
        String linkText = link.text();
        //System.out.println(linkText);
        String[] words=linkText.split("\\W");
        for(String str:words) 
        {
            System.out.println(str);
        }
    }
}

Upvotes: 0

NomanJaved
NomanJaved

Reputation: 1390

String testText1 = d.select("body").text();
System.out.println(testText);

or

String testText2 = d.select("body p").text();
System.out.println(testText);

You can use this for getting the text from tags.

Upvotes: 0

MaVRoSCy
MaVRoSCy

Reputation: 17849

This can do the job

Elements e=doc.select("p"); 

Here is a list of all selectors you can use.

Suppose you have this html:

String html="<p>some <strong>bold</strong> text</p>";

To get some bold text as result you should use:

Document doc = Jsoup.parse(html);
Element p= doc.select("p").first();
String text = doc.body().text(); //some bold text

or

String text = p.text(); //some bold text

Suppose now you have the following complex html

String html="<div id=someid><p>some text</p><span>some other text</span><p> another p tag</p></div>"

To get the values from the two p tags you have to do something like this

Document doc = Jsoup.parse(html);
Element content = doc.getElementById("someid");
Elements p= content.getElementsByTag("p");

String pConcatenated="";
for (Element x: p) {
  pConcatenated+= x.text();
}

System.out.println(pConcatenated);//sometext another p tag

You can find more info here also

Hope this helped

Upvotes: 21

Related Questions