Is this a good way to implement a search feature in my Rails application that uses dbpedia and SPARQL? Is there a better way to do this?

Question

I'm trying to put together a "movie search" application using Ruby on Rails 3. I'm pulling data from dbpedia using SPARQL (RDF and sparql/client). I want a potential user to be able to search for a movie, view the results, and then click to view a page that I generate on that movie that contains more information (both from dbpedia and my own local database).

This is my first time using a huge data set and SPARQL and I've noticed it's very slow, and I guess that can't be helped. I would still very much like to use it as a data source though.

I have my rails app set up to use MongoDB, so I was thinking that I can utilize that to cache some of the DBPedia data so users don't need to wait for a query every single time. However I'm stuck on the best way to implement something like this. My current thought is something along these lines:

On the first search ever, I store details for each result in my local database (probably basic movie info such as title, overview, year, alternate titles)

When a user does a search, the following occurs:

Run the search query on my local database to get relevant stored movies (searching title and overview only, most likely). If the movie hasn't been updated from dbpedia in the past X days, I don't include it.
Quickly display those relevant local results to the user and make a list of those movies.
While the user views the stored results, dbpedia gets queried. From this query result I create a list of the relevant results from DBpedia.
I remove any movies from the dbpedia query result set that are already in the initial local result set to prevent the user from seeing duplicate results.
I display the remaining dbpedia query results underneath the local results, and save each of the new non-stored results in my local database (including last_updated time, and updating any existing local items as needed).
When a user clicks through to a movie page, the basic information from dbpedia and my extra info I am storing are already stored locally and can be pulled up on the page quickly, but more advanced information (director, language, location, links to relevant sites) is queried from dbpedia at the time of loading. I show loading dialogs etc. on different sections while the new info is retrieved.

I was thinking of doing something like the above so the user can see a few results quickly while the remaining results get loaded from dbpedia, and I am storing some things but not an insane amount.

But I wanted to get some help on whether this is realistic and whether it is a good idea. I can imagine that searching my local db first might skew the user's initial results towards things that have been searched before, and if their particular desired movie (if they put in a title for example) hasn't been searched before it might show up further down the list. Would it make more sense to just store a copy of the relevant data set (i.e. all movies) locally and update it as needed? That would be too much, right?

Anyway I would really appreciate some suggestions on a good way to make things as seamless as possible for the user while still dwelling within the boundaries of sanity. Thanks in advance!

Edit: Here is the code for a test search query I am currently using. I thought I was making it super super basic for testing... but it times out a lot.

query = "
    PREFIX owl: 
    PREFIX xsd: 
    PREFIX rdfs: 
    PREFIX rdf: 
    PREFIX foaf: 
    PREFIX dc: 
    PREFIX : 
    PREFIX dbpedia2: 
    PREFIX dbpedia: 
    PREFIX skos: 
    PREFIX dbo: 

    SELECT ?subject ?label ?abstract ?runtime ?date ?name WHERE {
    {?subject rdf:type }
    UNION
    {?subject rdf:type }.
    OPTIONAL {?subject dbo:runtime ?runtime}.
    OPTIONAL {?subject dbo:releaseDate ?date}.
    OPTIONAL {?subject foaf:name ?name}.
    ?subject rdfs:comment ?abstract.
    ?subject rdfs:label ?label.
    FILTER((lang(?abstract) = 'en') && (lang(?label) = 'en') && REGEX(?label, '" + str + "')).

    }
    LIMIT 30
"
 result = {}
 client = SPARQL::Client.new("http://dbpedia.org/sparql")
 result = client.query(query).each_binding  { |name, value| puts value.inspect }
 return result

William Greenly · Accepted Answer

What is the SPARQL query you are using to query dbpeid?. It should be possible to optimise this to improve performance. You should also be able to filter using category URI's. Also you should be able to use OFFSET and LIMIT projections to reduce the number of results. If you are using full text searchs then you might also consider using the Virtuoso Specific 'bif:contains' property since it is a bit quicker that regex filters, although has the downside of being non-standard / Virtuoso specific. Addiotnally, you can also use HTTP caching to improve subsequent search results (SPARQL protocol operates over HTTP unsurprisingly).

Other than that, instead of putting stuff into mongo db, you might try to simply use your own triplestore and load movies from dbpedia into it each night.

EDITED based on provision of query

Ok simply by trial and error, the following patterns are causing big problems:

    ?subject rdfs:comment ?abstract.
    ?subject rdfs:label ?label.
    FILTER((lang(?abstract) = 'en') && (lang(?label) = 'en') && REGEX(?label, '" + str + "')).

Filters can be slow, but even without the filters the query times out. I would have been more concerned with the OPTIONAL clauses (OPTIONAL can be slow). Try it wihtout. You might need to run a separate query for the abstracts and labels.

Is this a good way to implement a search feature in my Rails application that uses dbpedia and SPARQL? Is there a better way to do this?

Answers (1)

Related Questions