Dmitrii Rassokhin
Dmitrii Rassokhin

Reputation: 21

How to extract a representative subset of triples from a triple store

We have an OpenLink Virtuoso-based triple (or rather quad-) store with about ~6bn triples in it. Our collaborators are asking us to give them a small subset of the data so they can test some of their queries and algorithms. Naturally, if we extract a random subset of graph-subject-predicate-object quads from the entire set, most of their SPARQL queries against the subset will find no solutions, because a small random subset of quads will represent an almost entirely disconnected graph. Is there a technique (possibly Virtuoso-specific) that would allow the extraction of a subset of quads s from the entire set S such that, for a given “select” or “construct” SPARQL query Q, Q executed against s would return the same solution as Q executed against the entire set S? If this could be done, it would be possible to run all sample queries that the collaborators want to be able to run against our dataset, extract that smallest possible subset and send it to them (as an n-quads file) so they can load it into their triple store.

Upvotes: 1

Views: 489

Answers (3)

Leslie Sikos
Leslie Sikos

Reputation: 549

Querying a subset of quads cannot give the same results as querying the entire dataset, because by considering only a small percentage of the quads, you loose quads in the answer too.

Upvotes: 0

TallTed
TallTed

Reputation: 9444

This doesn't have a pre-built Virtuoso-specific solution. (ObDisclaimer: OpenLink Software produces Virtuoso, and employs me.)

As noted in the comments above, it's actually a rather complex question. From many perspectives, the simple answer to "what is the minimal set to deliver the same result to the same query" is "all of it," and this might well work out to be the final answer, due to the effort needed to arrive at any satisfactory smaller subset.

The following are more along the lines of experimental exploration, than concrete advice.

  • It sounds like your collaborators want to run some known query, so I'd start by running that query against your full dataset, and then do a DESCRIBE over each ?s and ?p and ?o that appears the in the result, load all that output as your subset, and test the original query against that.

    If known, explicitly including all the ontological data from the large set in the small may help.

    If this sequence doesn't deliver the expected result set, you might try a second, third, or more rounds of the DESCRIBE, this time targeting every new ?s and ?p and ?o that appeared in the previous round.

  • The idea of exposing your existing endpoint, with the full data set, to your collaborators is worth considering. You could grant them only READ permissions and/or adjust server configuration to limit the processing time, result set size, and other aspects of their activity.

  • A sideways approach to this that may help in thinking about it might be to consider that in the SQL Relational Table world, a useful subset of a single table is easy -- it can be just a few rows, that include at least the one(s) that you want your query to return (and often at least a few that you want your query to not return).

    With a SQL Relational Schema involving multiple Tables, the useful subset extends to include the rows of each table which are relationally connected to the (un)desired row(s) in any of the others.

    Now, in the RDF Relational Graph world, each "row" of those SQL Relational Tables might be thought of as having been decomposed, with the primary key of each table becoming a ?s, and each column/field becoming a ?p, and each value becoming a ?o.

    The reality is (obviously?) more complex (you might look at the W3C RDF2RDF work, and the Direct Mapping and R2RML results of that work, for more detail), but this gives a conceptual starting point for considering how to find the quads (or triples) from an RDF dataset that comprise the minimum sub-dataset that will satisfy a given query against both dataset and subdataset.

Upvotes: 1

Kingsley Uyi Idehen
Kingsley Uyi Idehen

Reputation: 925

You must have a known count of entity types in your database, right? Assuming that to be true, why don't you simply apply a SPARQL DESCRIBE to a sampling of each entity per entity type?

Example:

DESCRIBE ?EntitySample { { SELECT SAMPLE(?Entity) as ?EntitySample COUNT (?Entity) as ?EntityCount ?EntityType WHERE {?Entity a ?EntityType} GROUP BY ?EntityType HAVING (COUNT (?Entity) > 10) LIMIT 50 } }

Upvotes: 1

Related Questions