Abhishek Ramachandran
Abhishek Ramachandran

Reputation: 1170

Extract JSON-LD from HTML using Apache Any23

My aim is to extract structured data from webpages. I'm using the code mentioned in this SO question. I'm using Apache Any23 CLI library dependency in my Spring project.

By using this, I'm able to extract the HTML5 Microdata (Schema.org) from webpages. But, I can't extract the JSON-LD format present in the webpages. When I checked Apache Any23's documentation, JSON-LD format is supported in it. Didn't find any further documentations on it.

Upvotes: 8

Views: 579

Answers (1)

ircecho
ircecho

Reputation: 93

Usually, if you create a new Any23 extractor with new Any23() it should work out of the box. If you use another constructor like Any23(String... extractorNames) you have to make make sure that the correct one is added for embedded JSON LD, which is "html-embedded-jsonld".

Now if there are any errors in the extraction process, Any23 drops them silently. (It's great, I know!)

I found it is possible to set a breakpoint in the org.apache.any23.extractorExtractionResultImpl method notifyIssue. With this you may be able to find a more detailed reason for your problems.

Upvotes: 0

Related Questions