cdaveau
cdaveau

Reputation: 129

JSoup doesn't retrieve JSON data from script tag

I'm trying to get content inside a script tag (JSON data) from a recipe in an HTML page, using JSoup (1.13.1). I won't post the HTML code but the script tag content is pretty big.

Whenever I try to print the content, I get an empty string. I tried to get my data using different methods: by selecting the ID doc.select("#__NEXT_DATA__"), or by using doc.select("script[type='application/json']")

If I try to iterate through all the script tags, whenever it gets to the script tag I want, it prints blank. I also tried to print the content using text() method and the toString() method but it doesn't work. I even saw someone saying you could set the maxBodySize(0) but it still doesn't work.

Here is my code:

String url = "https://www.marmiton.org/recettes/recette_gateau-au-chocolat-fondant-rapide_166352.aspx";
doc = Jsoup.connect(url).maxBodySize(0).get();

Elements newsHeadlines = doc.select("#__NEXT_DATA__");
                    
for (Element element : newsHeadlines) {
    System.out.println(element);
}

Upvotes: 0

Views: 884

Answers (3)

user2711811
user2711811

Reputation:

Treat the script element as data:

Elements newsHeadlines = doc.select("#__NEXT_DATA__");

for (Element element : newsHeadlines) {
    System.out.println(element.data());
}

Note that some consoles may have an issue displaying a line of 81206 characters in length (eclipse did for me) (or there was something in the data) so this code simply prints out the beginning...

    for (Element element : newsHeadlines) {
        System.out.println(element.data().length());
        
        int printLen = Math.min(100, element.data().length());
        System.out.println(element.data().substring(0,printLen));
    }

And produces:

81206
{"props":{"pageProps":{"recipeData":{"recipe":{"id":166352,"guid":"7bf48b95-4cd2-4b32-8f41-fb6168510

Note if you can use a debugger in your environment it would show that the element had the result all along but as a childNode of element of type DataNode which is the first clue.

Upvotes: 1

Jonas N
Jonas N

Reputation: 1777

Jsoup's text() returns the text that would have been rendered in a browser, sort of. A 'script' tag won't render at all (unless you use CSS tricks!), so it returns an empty string. At least I think that's what Jsoup's developer/s were thinking.

Instead, what you can do is use the html() method, which returns some sort of 'raw' text, IOW the text inside the script element.

Upvotes: 0

SatvikVejendla
SatvikVejendla

Reputation: 459

Jsoup doesn't actually parse the script tags. When it scrapes the website, it takes the HTML source of the website BEFORE any Javascript scripts play their part. So, when you try to get the scripts, it doesn't recognize the script tags.

For this case, you might want to try another API, such as Selenium.

Upvotes: 0

Related Questions