SPARQL: how to find similar strings?

I'm using Jena to query data stored in an ontology. Some of the objects are identified by a string, however sometimes the exact same string is not available, as I am processing scanned documents and so there may be OCR-Errors. Therefore, I'd like to find the most similar strings. Is there a way to use SPARQL for this purpose? Can I somehow calculate levenshtein distance in SPARQL?

If this is not possible, I can still calculate the levenshtein distance in java. However, an efficient algorithm would still require to filter out irrelevant strings using SPARQL.

Upvotes: 4

Answers (3)

Vladimir Alexiev

Reputation: 2601

For sesame there's fr/sparna/rdf/sesame/toolkit/functions/LevenshteinDistanceFunction but can't find the source.

Upvotes: 1

Pedro

Reputation: 4160

In case anyone's interested, this is how I implemented it:

public class LevenshteinFilter extends FunctionBase2 
{  
     public NodeValue exec(NodeValue value1, NodeValue value2){
         int i = StringUtils.getLevenshteinDistance(value1.asString(), value2.asString()); 
         return NodeValue.makeInteger(i); 
     }
}

usage:

 String functionUri = "http://www.example.org/LevenshteinFunction"; 
 FunctionRegistry.get().put(functionUri , LevenshteinFilter.class); 
 String s = "...";
 String sparql = "SELECT ?x WHERE { ?x a Something . " +
                                   "?x hasString ?str . " + 
                                   "FILTER(<"+functionUri +">(?str, \"" + s + "\") < 5) }";
 QueryExecution qexec = QueryExecutionFactory.create(sparql, model); 
 ResultSet rs = qexec.execSelect(); 
 while(rs.hasNext()){
     ...
 }

Upvotes: 4

Gregory Williams

Reputation: 164

SPARQL can't do this directly, but you could implement the levenshtein distance function in java, and use it in a SPARQL FILTER clause. Extensions in ARQ has details about using extension functions.

Upvotes: 6

SPARQL: how to find similar strings?

Answers (3)

Related Questions