Reputation: 1742
I have a sparql performance issue with DBpedia. I'd like to extract ordered information from DBpedia sparql endpoint page by page. My first example query looked like this:
select distinct ?objProperty ?label where {
?x ?objProperty <http://dbpedia.org/resource/United_States>.
?objProperty a owl:ObjectProperty.
OPTIONAL{?objProperty rdfs:label ?label}
}order by ?label limit 10 offset 3
It was executed about 2s for me on avg(please, if you try it yourself and you see timing less than a second - increment 'offset', because it seems that DBpedia's Virtuoso is caching request results).
However the result returned is not suitable for pagination, because it is a mess of lines with labels from different languages. I want English language for labels and for precise pagination I want exactly 10 different object properties to be returned as a result. Also they have to be ordered by label. Ok. Another try:
select distinct ?objProperty ?label where {
?a ?objProperty <http://dbpedia.org/resource/United_States>.
?objProperty a owl:ObjectProperty.
OPTIONAL{?objProperty rdfs:label ?label}
FILTER ( LANGMATCHES(lang(?label),"EN") || LANG(?label) = "")
}order by ?label limit 10 offset 3
For me this request returned what I expected,.. but it was executed about 7 seconds on avg!!! So sloooow!!! Without order by and langmatch, query works about 1s on avg. Without order by but with langmatch, it takes about 6s, so it seems that langmatch eats ~ 5s on avg for this query.
I do not understand (these are questions by the way):
Am I doing something wrong? :)
Why langmatch slows query SOOO much? I wish langmatch is not regex based? If this performance issue is unavoidable using langmatch, is there a faster way to work with languages? If no, I can't imagine how semantic technologies would conquer the world in nearest future as people expect :))
Is there a better (faster) way to build pagination based requests than using limit/offset? If no, what is the best way to avoid performance issues like mentioned above with limit/offset?
Upvotes: 0
Views: 853
Reputation: 85883
1. Am I doing something wrong? :)
I think there's a slight issue that could make your query a bit faster. You've got the ?label
as optional, but I think that the filter
will only succeed when ?label
is bound, effectively making ?label
non-optional. My reasoning is as follows: in the case where ?label
is not bound, the expression lang(?label)
will be an error (unless an implementation extends lang()
), and both langMatches
and =
expect non-error values, so we'd have this reduction:
langMatches(lang(?label),"en") || lang(?label) = "en"
langMatches(error, "en") || error = "en"
error || error
false
I'm basing this on section 17.2 of the SPARQL 1.1 recommendation, which says:
17.2 Filter Evaluation
- Functions invoked with an argument of the wrong type will produce a type error. Effective boolean value arguments (labeled "xsd:boolean (EBV)" in the operator mapping table below), are coerced to xsd:boolean using the EBV rules in section 17.2.2.
- Apart from BOUND, COALESCE, NOT EXISTS and EXISTS, all functions and operators operate on RDF Terms and will produce a type error if any arguments are unbound.
- Any expression other than logical-or (||) or logical-and (&&) that encounters an error will produce that error.
Based on that, I'd rewrite the query as the following. My impression is that it's a little bit faster, but that might just be confirmation bias. It's not much faster, though.
select distinct ?p ?label where {
?x ?p dbpedia:United_States .
?p a owl:ObjectProperty ;
rdfs:label ?label .
filter( langMatches(lang(?label),"en") || lang(?label) = "" )
}
order by ?label
limit 10
offset 3
2. Why langmatch slows query SOOO much? I wish langmatch is not regex based? If this performance issue is unavoidable using langmatch, is there a faster way to work with languages?
The public DBpedia SPARQL endpoint can be a bit slow at times, but that doesn't seem to be the issue here. When I run your original query, or the new one above, query, it takes six or seven seconds to get the results. Two things to note though:
langMatch
isn't regular expression based. The docs for langMatches
say that "Returns true if language-tag (first argument) matches language-range (second argument) per the basic filtering scheme defined in RFC4647 section 3.3.1. language-range is a basic language range per Matching of Language Tags RFC4647 section 2.1. A language-range of "*" matches any non-empty language-tag string." The basic filtering is case insensitive, but it's not regex.langMatches
isn't the only thing that might be causing some slower results. Note that to find the first 10 of something (or, in general, the mth through the _n_th), you have to visit all the elements. You don't have to sort all of them, but you have to visit all of them, which means that there's no way to get just the results from the desired page (unless there's some special indexing going on; keep making this query and maybe it will speed up overtime :)). This leads us into the next point, though.3. Is there a better (faster) way to build pagination based requests than using limit/offset? If no, what is the best way to avoid performance issues like mentioned above with limit/offset?
While the original and updated queries take six or seven seconds to retrieve the 10 results with limit 10
, asking for limit 1000
, or limit 5000
, also only take about six or seven seconds. Using limit
/offset
is the correct way to do pagination, but ordering the results can be expensive, since to find the elements in some particular range, you have to look at all the elements (though you don't necessarily have to order all the elements). It probably makes sense, then, to make those pages as big as possible, and to do any presentation paging locally. E.g., instead of running 100 queries for 10 results each (100 queries × 7 seconds = 700 seconds = 11 minutes and 40 seconds), you can run 1 query for 1000 results (1 query × 7 seconds = 7 seconds), and do any important paged presentation locally.
Upvotes: 2
Reputation: 1
Handling of language filter is up to SPARQL engine. How it stores literals? Whether it can use indexes or another technique to avoid full text scan to get literal for desired language?
You can store literal as "chat"@en
string, but selecting all literals for english for a given property would require all property literals scan for @en
match.
In some SPARQL engines, you can get actual execution plan. For example, here is the way to do it in Virtuoso: Virtuoso execution plan, however, you can't use it on public endpoint.
Query optimization, execution, query hints are very well documented for RDBMS, you can easily find out what database really does to answer your query and how to modify schema or query to get best results. IMHO, SPARQL engines are not that mature for this.
Upvotes: 0