Reputation: 4160
I'm using Jena to query data stored in an ontology. Some of the objects are identified by a string, however sometimes the exact same string is not available, as I am processing scanned documents and so there may be OCR-Errors. Therefore, I'd like to find the most similar strings. Is there a way to use SPARQL for this purpose? Can I somehow calculate levenshtein distance in SPARQL?
If this is not possible, I can still calculate the levenshtein distance in java. However, an efficient algorithm would still require to filter out irrelevant strings using SPARQL.
Upvotes: 4
Views: 1995
Reputation: 2601
For sesame there's fr/sparna/rdf/sesame/toolkit/functions/LevenshteinDistanceFunction
but can't find the source.
Upvotes: 1
Reputation: 4160
In case anyone's interested, this is how I implemented it:
public class LevenshteinFilter extends FunctionBase2
{
public NodeValue exec(NodeValue value1, NodeValue value2){
int i = StringUtils.getLevenshteinDistance(value1.asString(), value2.asString());
return NodeValue.makeInteger(i);
}
}
usage:
String functionUri = "http://www.example.org/LevenshteinFunction";
FunctionRegistry.get().put(functionUri , LevenshteinFilter.class);
String s = "...";
String sparql = "SELECT ?x WHERE { ?x a Something . " +
"?x hasString ?str . " +
"FILTER(<"+functionUri +">(?str, \"" + s + "\") < 5) }";
QueryExecution qexec = QueryExecutionFactory.create(sparql, model);
ResultSet rs = qexec.execSelect();
while(rs.hasNext()){
...
}
Upvotes: 4
Reputation: 164
SPARQL can't do this directly, but you could implement the levenshtein distance function in java, and use it in a SPARQL FILTER clause. Extensions in ARQ has details about using extension functions.
Upvotes: 6