Mark Miller
Mark Miller

Reputation: 3096

Check for differences between RDF models with non-deterministic (UUID-based) URIs?

A college and I are individually instantiating electronic health records into triples. We'd like to compare our sets of 10k to 100k triples to see if they have the same shapes.

As a policy, I create URIs based on UUIDs, so nothing semantic is embedded in them. I'd like to stick with this policy, as my college and I are really trying to holistically compare existing workflows.

I know how to compare two RDF files in TopBraid Composer, but I don't think it will be useful if we have the same data patterns but different URIs. I store my triples in Ontotext GraphDB but am glad to use any other tool.

For example, the triples about person ...fe54977c174a and person ...4bcdc1c8abf9 should be considered equivalent, but ...fe54977c174a and ...ae00dc86b3bb should not. Is this feasible?

I would prefer not to spot-check with hand-crafted SPARQL ASK statements.

@prefix ns0: <http://example.com/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.com/4f79ea05-2358-4f43-a335-fe54977c174a>
  a <http://example.com/Person> ;
  ns0:gender ns0:Male ;
  ns0:participatesIn ns0:5d2dfc7b-994c-4933-b787-f7971dae397c .

ns0:5d2dfc7b-994c-4933-b787-f7971dae397c
  a ns0:HealthCareEncounter ;
  ns0:startDate "2019-05-01"^^xsd:date ;
  ns0:hasOutput ns0:a129ca96-c6d2-4a07-a4eb-4cf9ce23a314 .

ns0:a129ca96-c6d2-4a07-a4eb-4cf9ce23a314
  a ns0:Diagnosis ;
  ns0:mentions ns0:Headache .

has the same shape as this (despite the different URIs):

@prefix ns0: <http://example.com/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.com/a740d254-084c-4621-b06d-4bcdc1c8abf9>
  a <http://example.com/Person> ;
  ns0:gender ns0:Male ;
  ns0:participatesIn ns0:060d2091-b4f7-406d-ab0d-75b39b400823 .

ns0:060d2091-b4f7-406d-ab0d-75b39b400823
  a ns0:HealthCareEncounter ;
  ns0:startDate "2019-05-01"^^xsd:date ;
  ns0:hasOutput ns0:bc549711-ed9d-4db6-8cf9-d43022903ef7 .

ns0:bc549711-ed9d-4db6-8cf9-d43022903ef7
  a ns0:Diagnosis ;
  ns0:mentions ns0:Headache .

but this is structurally different (due to the different gender and diagnosis mention):

@prefix ns0: <http://example.com/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.com/aa3a977a-999a-4c5c-9524-ae00dc86b3bb>
  a <http://example.com/Person> ;
  ns0:gender ns0:Female ;
  ns0:participatesIn ns0:b31a62a5-337a-454d-a637-85aefef26684 .

ns0:b31a62a5-337a-454d-a637-85aefef26684
  a ns0:HealthCareEncounter ;
  ns0:startDate "2019-05-01"^^xsd:date ;
  ns0:hasOutput ns0:6566d543-773e-4649-b589-66eb3d0f3165 .

ns0:6566d543-773e-4649-b589-66eb3d0f3165
  a ns0:Diagnosis ;
  ns0:mentions ns0:Nausea .

Upvotes: 0

Views: 111

Answers (1)

Jeen Broekstra
Jeen Broekstra

Reputation: 22052

Eclipse Rdf4j (bundled with GraphDB) contains a graph isomorphism utility: Models.isomorphic. By default it only does blank node to blank node mappings. So you have two options:

  1. do a replace of each IRI in your graphs with a (dictionary-mapped) blank node. This should be fairly easy to do with a HashMap and a bit of looping or streaming-magic.
  2. have a look at the code for the Models utility and adapt the bit where it does blank node mapping to do IRI mapping instead.

Upvotes: 1

Related Questions