Reputation: 1129
I have my HTML page as shown below:
<htm>
<section class="posts">
<script type="application/ld+json">
{
"url": "http://schema.org",
"title": "some Title"
}
</script>
<article class="post">
</html>
I want to extract the data between <script type="application/ld+json">
and </script>
. I have tried with the following code but its not working.
Pattern pattern = Pattern.compile("<script type=\"application\\/ld\\+json\">(.*?)</script>");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Am I doing something wrong? Thanks.
Upvotes: 1
Views: 1153
Reputation: 7753
The regex to select the JSON from above HTML
<script type="application\/ld\+json">(.*)<\/script>
In Java code:
String str = "<htm><section class=\"posts\"><script type=\"application/ld+json\">{\"url\": \"http://schema.org\", \"title\": \"some Title\"}</script><article class=\"post\"></html>";
String regex = "<script type=\"application\\/ld\\+json\">(.*)<\\/script>";
Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
prints
{"url": "http://schema.org", "title": "some Title"}
See DEMO for explanation
Upvotes: 4
Reputation:
Jsoup may be the best solution for you; it allows you to quickly and easily parse HTML. For your particular problem (assuming that you are getting the HTML from a String), the following would work:
Document doc = Jsoup.parse(str);
Elements scriptElements = doc.select("script[type=\"application/ld+json\"]");
String scriptContent = scriptElements.first().html()
Upvotes: 2