Reputation: 38143
I am working on a project that extracts content from web pages and normalizes this content to a discrete set of types. Right now I am only working with text and images.
For images, I've found https://schema.org/ImageObject, which seems to fit just fine.
For text, however, I am not sure what to use. Except for the primitive datatype http://schema.org/Text, I'm not finding anything on schema.org that represents generic text. I am new to linked, semantic data, and not sure whether primitives are intended to be used as full-on types.
Furthermore, I would like to be able to distinguish text fragments by their use on the source webpage. For example, I'd like to be able to specify that one span of text was paragraph text, while another was header text. On schema.org there is https://schema.org/WebPageElement, which also includes https://schema.org/WPHeader, but there is no WPParagaph, or WPTextFragment, or anything like that.
I've looked around other vocabularies, but not sure which might be a good fit. Above all, I am looking to employ something that already exists and people recognize.
Upvotes: 0
Views: 76
Reputation: 959
Have you taken a look at the Open Annotation ontology, from the W3C? (http://www.openannotation.org/spec/core/core.html#BodyEmbed). Currently it is only a draft, but it could help you annotating pieces of text. It also allows you to assert from which document you have extracted the text and ownership of the annotations (i.e., their provenance). I don't think it includes terms such as "header", but it has selectors for specifying the concrete parts of the annotated web page/document you are annotating: http://www.openannotation.org/spec/core/specific.html#TextPositionSelector.
It also provides the mechanisms to annotate areas of images (http://www.openannotation.org/spec/core/specific.html#SvgSelector). It could be as simple or complex as you want.
Upvotes: 2