Jean-Pierre Coffe
Jean-Pierre Coffe

Reputation: 95

Querying against a Wikipedia RDF file (Turtle format) with Apache Jena

I'm looking for a way to query against a RDF file formatted in Turtle syntax. The RDF file is actually the whole Wikipedia categories hierarchy, provided by Wikidata.

Here is an extract from the contents of the file enwiki categories.ttl, showing the global structure of the data :

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix mediawiki: <https://www.mediawiki.org/ontology#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix schema: <http://schema.org/> .
@prefix cc: <http://creativecommons.org/ns#> .

<https://en.wikipedia.org/wiki/Category:1148_establishments_in_France> a mediawiki:Category ;
    rdfs:label "1148 establishments in France" ;
    mediawiki:pages "2"^^xsd:integer ;
    mediawiki:subcategories "0"^^xsd:integer .

<https://en.wikipedia.org/wiki/Category:1148_establishments_in_France> mediawiki:isInCategory <https://en.wikipedia.org/wiki/Category:1140s_establishments_in_France>,
        <https://en.wikipedia.org/wiki/Category:1148_establishments_by_country>,
        <https://en.wikipedia.org/wiki/Category:1148_establishments_in_Europe>,
        <https://en.wikipedia.org/wiki/Category:1148_in_France>,
        <https://en.wikipedia.org/wiki/Category:Establishments_in_France_by_year> .

My final goal is to be able to retrieve all parent categories of a Wikipedia category by querying the RDF Turtle file. Here is a very short Java code example showing my issue :

LogCtl.setCmdLogging();
Model model = ModelFactory.createDefaultModel();
model.read("enwiki-categories.ttl");

The RDF Turtle file is well over 850 MB, loading the model using the previously shown code causes an out of memory error. I need a way to query against the RDF file without having to load the full RDF database in memory.

--

Is there a way to do this using Apache Jena or another library ?

If not, is there a faster way to retrieve all parent categories from a given category in Wikipedia, using local data ?

Upvotes: 0

Views: 629

Answers (2)

Gilles-Antoine Nys
Gilles-Antoine Nys

Reputation: 1481

What you intend to do is called "Broader Concept".

It is formalised in SKOS (skos:broader). Here is the link to the documentation : SKOS

The definition of SKOS is :

Simple Knowledge Organization System (SKOS) is a common data model for sharing and linking knowledge organization systems via the Web.

For instance, the broader concept of a Tree is Plant. And Tree is the broader concept of Pine or Oak... It is formalised in SKOS (skos:broader).

Upvotes: 1

Henriette Harmse
Henriette Harmse

Reputation: 4787

Yes, you can do the query with Jena. It is exactly what Jena is designed to do. I would however suggest you import the file into an RDF data store and then use Jena to do an SPARQL query against the RDF data store.

You may want to see my answer to a related question on SO where I give some references to RDF data stores.

Upvotes: 1

Related Questions