user3062229
user3062229

Reputation: 111

how to extract the text after certain tags using jsoup

i am using jsoup to extract tweeter text. so the html structure is

 <p class="js-tweet-text tweet-text">@sexyazzjas There is so much love in the air, Jasmine! Thanks for the shout out. <a href="/search?q=%23ATTLove&amp;src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" ><s>#</s><b>ATTLove</b></a></p>

what i want to get isThere is so much love in the air, Jasmine! Thanks for the shout out. and i want to extract all the tweeter text in the entire page. I am new to java. the code has bugs. please help me thank you

below is my code:

    package htmlparser;
    import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class tweettxt {

    public static void main(String[] args) {

        Document doc;
        try {

            // need http protocol
            doc = Jsoup.connect("https://twitter.com/ATT/").get();

            // get page title
            String title = doc.title();
            System.out.println("title : " + title);

            Elements links = doc.select("p class="js-tweet-text tweet-text"");
            for (Element link : links) {


                System.out.println("\nlink : " + link.attr("p"));
                System.out.println("text : " + link.text());

            }






        } catch (IOException e) {
            e.printStackTrace();
        }

      }

    }

Upvotes: 0

Views: 1486

Answers (1)

Alkis Kalogeris
Alkis Kalogeris

Reputation: 17745

Although I do agree with Robin Green about using the API and not Jsoup in this occasion, I will provide a working solution for what you asked just to close this topic and for help on future viewers that have a problem with

  1. selector with tag that has two or more classes
  2. Get the direct text of a Jsoup element that contains other elements.

    public static void main(String[] args) {
    
        Document doc;
        try {
    
            // need http protocol
            doc = Jsoup.connect("https://twitter.com/ATT/").get();
    
            // get page title
            String title = doc.title();
            System.out.println("title : " + title);
    
            //select this <p class="js-tweet-text tweet-text"></p>
            Elements links = doc.select("p.js-tweet-text.tweet-text");  
    
            for (Element link : links) {
                System.out.println("\nlink : " + link.attr("p"));
                 /*use ownText() instead of text() in order to grab the direct text of 
                 <p> and not the text that belongs to <p>'s children*/
                System.out.println("text : " + link.ownText());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    

Upvotes: 2

Related Questions