Mostafa Shakoori
Mostafa Shakoori

Reputation: 305

How to get this script from html with jsoup in android programming

I want to get a string value from a script with jsoup from a html page. But there are some problems:

  1. there are six scipts in that page. and i want to select forth of all with jsoup(I mean number 4). and I don't know how I can do it.
  2. there is a key in that script and i want to catch value of that key

here you can see wanted script:

<script type="text/javascript">window._sharedData={

  "entry_data": {
    "PostPage": [
      {
        "media": {

          "key": "This is the key and i wanna catch it!!!",

        },      
      }
    ]
  },

};</script>

I have tried many ways, but I wasn't successful.

I'm looking forwrd to get the answer, so pls don't let me down!

Upvotes: 1

Views: 1668

Answers (1)

luksch
luksch

Reputation: 11712

JSoup will only help you to get the contents of the script tag as a string. It parses HTML, not script content which is JavaScript. Since in your case the contents of the script is a simple object in JSON notation you could employ a JSON parser after you get the script string and stripping off the variable assignment. IN the below code I use the JSON simple parser.

String html = "<script></script><script></script><script></script>"
    +"<script type=\"text/javascript\">window._sharedData={"
    +"  \"entry_data\": {"
    +"    \"PostPage\": ["
    +"      {"
    +"        \"media\": {"
    +"          \"key\": \"This is the key and i wanna catch it!!!\","
    +"        },"
    +"      }"
    +"    ]"
    +"  },"
    +"};</script><script></script>";
Document doc = Jsoup.parse(html);
//get the 4th script
Element scriptEl = doc.select("script").get(3);
String scriptContentStr = scriptEl.html();
//clean to get json
String jsonStr = scriptContentStr
     .replaceFirst("^.*=\\{", "{") //clean beginning
     .replaceFirst("\\;$", ""); //clean end
JSONObject jo = (JSONObject) JSONValue.parse(jsonStr);
JSONArray postPageJA = ((JSONArray)((JSONObject)jo.get("entry_data")).get("PostPage"));
JSONObject mediaJO = (JSONObject) postPageJA.get(0);
JSONObject keyJO = (JSONObject) mediaJO.get("media");
String keyStr = (String) keyJO.get("key");

System.out.println("keyStr = "+keyStr);

This is a bit complicated, and also it depends on your knowledge about the structure of the JSON. A much simpler way may be to use regular expressions:

Pattern p = Pattern.compile(
    "media[\":\\s\\{]+key[\":\\s\\{]+\"([^\"]+)\"", 
    Pattern.DOTALL);
Matcher m = p.matcher(html);
if (m.find()){
    String keyFromRE = m.group(1);
    System.out.println("keyStr (via RegEx) = "+keyFromRE);  
}

Upvotes: 4

Related Questions