nmvictor
nmvictor

Reputation: 1129

Extract data between HTML tags

I have my HTML page as shown below:

<htm>

<section class="posts">

      <script type="application/ld+json">
        {
          "url": "http://schema.org",
          "title": "some Title"
        }
      </script>


    <article class="post">
</html>

I want to extract the data between <script type="application/ld+json"> and </script>. I have tried with the following code but its not working.

Pattern pattern = Pattern.compile("<script type=\"application\\/ld\\+json\">(.*?)</script>");
Matcher matcher = pattern.matcher(str);
    while (matcher.find()) {
       System.out.println(matcher.group(1));
    }

Am I doing something wrong? Thanks.

Upvotes: 1

Views: 1153

Answers (2)

MaxZoom
MaxZoom

Reputation: 7753

The regex to select the JSON from above HTML

<script type="application\/ld\+json">(.*)<\/script>

In Java code:

String str = "<htm><section class=\"posts\"><script type=\"application/ld+json\">{\"url\": \"http://schema.org\",          \"title\": \"some Title\"}</script><article class=\"post\"></html>";
String regex = "<script type=\"application\\/ld\\+json\">(.*)<\\/script>";
Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
  System.out.println(matcher.group(1));
}

prints

{"url": "http://schema.org", "title": "some Title"}

See DEMO for explanation

Upvotes: 4

user4316868
user4316868

Reputation:

Jsoup may be the best solution for you; it allows you to quickly and easily parse HTML. For your particular problem (assuming that you are getting the HTML from a String), the following would work:

Document doc = Jsoup.parse(str);
Elements scriptElements = doc.select("script[type=\"application/ld+json\"]");
String scriptContent = scriptElements.first().html()

Upvotes: 2

Related Questions