How can I semantically represent generic, extracted text?

Question

I am working on a project that extracts content from web pages and normalizes this content to a discrete set of types. Right now I am only working with text and images.

For images, I've found https://schema.org/ImageObject, which seems to fit just fine.

For text, however, I am not sure what to use. Except for the primitive datatype http://schema.org/Text, I'm not finding anything on schema.org that represents generic text. I am new to linked, semantic data, and not sure whether primitives are intended to be used as full-on types.

Furthermore, I would like to be able to distinguish text fragments by their use on the source webpage. For example, I'd like to be able to specify that one span of text was paragraph text, while another was header text. On schema.org there is https://schema.org/WebPageElement, which also includes https://schema.org/WPHeader, but there is no WPParagaph, or WPTextFragment, or anything like that.

I've looked around other vocabularies, but not sure which might be a good fit. Above all, I am looking to employ something that already exists and people recognize.

How can I semantically represent generic, extracted text?

Answers (1)

Related Questions