Reputation: 111
i am using jsoup to extract tweeter text. so the html structure is
<p class="js-tweet-text tweet-text">@sexyazzjas There is so much love in the air, Jasmine! Thanks for the shout out. <a href="/search?q=%23ATTLove&src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" ><s>#</s><b>ATTLove</b></a></p>
what i want to get isThere is so much love in the air, Jasmine! Thanks for the shout out.
and i want to extract all the tweeter text in the entire page.
I am new to java. the code has bugs. please help me thank you
below is my code:
package htmlparser;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class tweettxt {
public static void main(String[] args) {
Document doc;
try {
// need http protocol
doc = Jsoup.connect("https://twitter.com/ATT/").get();
// get page title
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("p class="js-tweet-text tweet-text"");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("p"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Upvotes: 0
Views: 1486
Reputation: 17745
Although I do agree with Robin Green about using the API and not Jsoup in this occasion, I will provide a working solution for what you asked just to close this topic and for help on future viewers that have a problem with
Get the direct text of a Jsoup element that contains other elements.
public static void main(String[] args) {
Document doc;
try {
// need http protocol
doc = Jsoup.connect("https://twitter.com/ATT/").get();
// get page title
String title = doc.title();
System.out.println("title : " + title);
//select this <p class="js-tweet-text tweet-text"></p>
Elements links = doc.select("p.js-tweet-text.tweet-text");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("p"));
/*use ownText() instead of text() in order to grab the direct text of
<p> and not the text that belongs to <p>'s children*/
System.out.println("text : " + link.ownText());
}
} catch (IOException e) {
e.printStackTrace();
}
}
Upvotes: 2