pland
pland

Reputation: 858

How should one filter a list of companies by industry using SPARQL?

I'm trying to create a list of companies within a particular industry type (PaaS/SaaS) using dbpedia and sparql. I read this post on creating a list of companies with a certain number of employees, and I wanted to FILTER for a particular industry within a sparql query such as this one:

https://gist.github.com/szydan/e801fa687587d9eb0f6a

I tried this query (ommitting prefixes here):

CONSTRUCT{
    ?iri a dbpedia-owl:Company;
         foaf:name ?companyName;
         dbpedia-owl:abstract ?description;
         owl:sameAs ?sameAs;
     dbpedia:countryCode ?countryCode;
         sindicetech:locationName ?locationName;
         sindicetech:locationCityName ?locationCityName
}WHERE{
  ?iri a dbpedia-owl:Company.
  OPTIONAL{  
       ?iri dbpedia-owl:abstract ?description.
       FILTER( lang(?description) = "en")
       FILTER (regex(?description, '^platform$')) .
  }
  {
    OPTIONAL{  
      ?iri foaf:name ?companyName.
      FILTER( lang(?companyName) = "en")
    }
  }UNION{
    OPTIONAL{     
      ?iri rdfs:label ?companyName .
      FILTER( lang(?companyName) = "en")
    }
  }
  OPTIONAL{     
      ?iri owl:sameAs ?sameAs
  } 
  {
    OPTIONAL{     
      ?iri dbpedia:locationCountry ?country.
      ?country dbpedia:countryCode ?countryCode 
      FILTER( lang(?countryCode) = "en")
    }
  }UNION{  
    OPTIONAL{     
      ?iri dbpedia-owl:locationCountry ?country.
      ?country dbpedia:countryCode ?countryCode 
      FILTER( lang(?countryCode) = "en")
    } 
  }
  OPTIONAL{
      ?iri dbpedia-owl:location ?location.
      ?location dbpedia:name ?locationName
      FILTER( lang(?locationName) = "en")
  }
  OPTIONAL{
      ?iri dbpedia-owl:locationCity ?locationCity.
      ?locationCity rdfs:label ?locationCityName
      FILTER( lang(?locationCityName) = "en")

  }
}
LIMIT 100

to see if I could find platform as a service companies...but i'm getting all kinds of results that don't have that word in the description. Perhaps the FILTER (regex(?description, '^platform$')) regex is wrong? Is there a way I could filter for:

?industrySector dbpedia-owl:industry <http://dbpedia.org/resource/Platform_as_a_service>

Or perhaps I should be trying to narrow it down by filtering ontologically?

http://mappings.dbpedia.org/index.php/OntologyProperty:Industry

I'm using DBPEDIA's Virtuoso to test these queries, and ideally, I'd like to arrive at a RDF hierarchy of categories with a CONSTRUCT query, that gives me all companies within a particular industry, such as PaaS, SaaS, etc. But I'm not married to CONSTRUCT queries, and I'll take any advice!

Upvotes: 1

Views: 759

Answers (1)

Joshua Taylor
Joshua Taylor

Reputation: 85823

Improving the query that you have

First, two notes.

  1. You should compare language tags with langMatches, not with lang(…) = ….
  2. SPARQL 1.1 includes property paths where you can use alternations, as well as values, whereby you can specify permissible values for a variable. That means that instead of:
  {
    OPTIONAL{  
      ?iri foaf:name ?companyName.
      FILTER( lang(?companyName) = "en")
    }
  }UNION{
    OPTIONAL{     
      ?iri rdfs:label ?companyName .
      FILTER( lang(?companyName) = "en")
    }
  }

either

optional { 
  ?iri rdfs:label|foaf:name ?companyName .
  filter langMatches(lang(?companyName),"en")
}

or

values ?nameProperty { rdfs:label foaf:name }
optional { 
  ?iri ?nameProperty ?companyName .
  filter langMatches(lang(?companyName),"en")
}

Property paths can make some other parts of your query shorter, too. E.g.,

?iri dbpedia-owl:locationCity ?locationCity.
?locationCity rdfs:label ?locationCityName

can be just:

?iri dbpedia-owl:locationCity/rdfs:label ?locationCityName

since you didn't use ?locationCity anywhere.

Finally, as to

i'm getting all kinds of results that don't have that word in the description. Perhaps the FILTER (regex(?description, '^platform$')) regex is wrong?

The regular expression doesn't quite do what you want it to:

FILTER (regex(?description, '^platform$'))

That will only match when the characters in the string are exactly "platform". It seems more like you'd want to check whether the description contains the word platform, in which case you can use contains, as in contains(?description,"platform"). But even if you update like that, you'll have

optional {
  ?iri dbpedia-owl:abstract ?description.
  filter contains(?description,"platform")
  filter langMatches(lang(?description),"en")
}

and that's still inside an optional block. The whole point of optional is that you can get results even if the optional part doesn't match. If you want to require that there is a description that contains the word platform, don't make it optional.

After all that, your query becomes:

prefix sindicetech: <urn:ex:sindicetech:>

construct {
    ?iri a dbpedia-owl:Company ;
         foaf:name ?companyName ;
         dbpedia-owl:abstract ?description ;
         owl:sameAs ?sameAs ;
         dbpedia:countryCode ?countryCode ;
         sindicetech:locationName ?locationName ;
         sindicetech:locationCityName ?locationCityName
}
where {
  ?iri a dbpedia-owl:Company ;
       dbpedia-owl:abstract ?description .
  filter langMatches(lang(?description),"en") .
  filter contains(?description,"platform") .
  optional {
    ?iri foaf:name|rdfs:label ?companyName.
    filter langMatches(lang(?companyName),"en")
  }
  optional {     
    ?iri owl:sameAs ?sameAs
  } 
  optional {
    ?iri (dbpedia:locationCountry|dbpedia-owl:locationCountry)/dbpedia:countryCode ?countryCode .
    filter langMatches(lang(?countryCode),"en")
  }
  optional {
    ?iri dbpedia-owl:location/dbpedia:name ?locationName
    filter langMatches(lang(?locationName),"en")
  }
  optional {
    ?iri dbpedia-owl:locationCity/rdfs:label ?locationCityName
    filter langMatches(lang(?locationCityName),"en")
  }
}
limit 100

SPARQL results

You can see that the results are about companies with "platform" in their descriptions.

Note that none of them have any dbpedia:countryCode properties though. I don't know where you found that property, but it doesn't appear to be used anywhere in DBpedia. The query select (count(*) as ?n) { ?x dbpedia:countryCode ?y } returns 0.

A different approach

Is there a way I could filter for:

?industrySector dbpedia-owl:industry <http://dbpedia.org/resource/Platform_as_a_service>

If you look at http://dbpedia.org/resource/Platform_as_a_service you'll that it's related to a number of companies (but not all that many) by a few different properties:

dpbedia triples

You might just ask for anything that's a company that's related to this by any property. E.g.,

select distinct ?company where {
  ?company a dbpedia-owl:Company ;
           ?property dbpedia:Platform_as_a_service .
}

SPARQL results

You can use that approach to get construct more detailed information, too. I'd end up with something like this:

prefix sindicetech: <urn:ex:sindicetech:>

construct {
  ?company a dbpedia-owl:Company ;
           foaf:name ?label ;
           dbpedia-owl:abstract ?abstract ;
           owl:sameAs ?_company ;
           sindicetech:location [ sindicetech:city ?city ;
                                  sindicetech:country ?country ] .
}
where {
  ?company a dbpedia-owl:Company ;
           ?property dbpedia:Platform_as_a_service ;
           rdfs:label ?label ;
           dbpedia-owl:abstract ?abstract .
  filter langMatches(lang(?label),"en")
  filter langMatches(lang(?abstract),"en")
  optional {
    ?company owl:sameAs ?_company
  }
  optional { 
    ?company dbpedia-owl:location [ rdfs:label ?city ;
                                    dbpedia-owl:country/rdfs:label ?country ] .
    filter langMatches(lang(?city),"en")
    filter langMatches(lang(?country),"en")
  }
}

SPARQL results

Upvotes: 5

Related Questions