Wajora
Wajora

Reputation: 53

Prefered rank in Wikidata not working properly for population in some cases?

So I'm currently working on a project where Im using data that I'm getting from Wikidata and I noticed a lot of duplicate elements in my database. Reason for that is that I'm receiving population numbers for different points in time.

I've read that Wikidata has rankings for statements with multipile values and for the population property that seems to be the most recent value-which is true for about 99.9% of the entries. What I don't understand is why it doesn't work for the other 0.1%.

One example would be: Wikidata query

The same happens for example with the elements

and I have no idea why.

I've already tried the solution from this topic but it didn't change the result.

Any ideas?


Edit based on the filter option from the thread: wikidata query 2

Edit 2: Full query

Upvotes: 3

Views: 619

Answers (1)

Stanislav Kralin
Stanislav Kralin

Reputation: 11479

Some Wikidata properties are processed by PreferentialBot (source code).

In short, the bot makes the most recent statements preferred, hence making them truthy.

Sometimes the bot does not process statements for a property. For example, the bot doesn't process items that have statements without respective qualifiers.

In your particular case:

SELECT DISTINCT ?city ?cityLabel ?population ?date ?rank WHERE {
  VALUES (?settlement) {(wd:Q515) (wd:Q15284)}
  VALUES (?city) {(wd:Q1658752)}
  ?city wdt:P31/wdt:P279* ?settlement . 
  ?city p:P1082 ?statement .
  ?statement ps:P1082 ?population .
  ?statement wikibase:rank ?rank
  OPTIONAL { ?statement pq:P585 ?date }  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }   
} ORDER by ?date

Try it

Results:

+-------------+-----------+------------+----------------------+---------------------+
|    city     | cityLabel | population |        date          |         rank        |
+-------------+-----------+------------+----------------------+---------------------+
| wd:Q1658752 | Kagan     |      86745 |                      | wikibase:NormalRank |
| wd:Q1658752 | Kagan     |      17656 | 1939-01-01T00:00:00Z | wikibase:NormalRank |
| wd:Q1658752 | Kagan     |      21103 | 1959-01-01T00:00:00Z | wikibase:NormalRank |
| wd:Q1658752 | Kagan     |      34117 | 1970-01-01T00:00:00Z | wikibase:NormalRank |
| wd:Q1658752 | Kagan     |      41565 | 1979-01-01T00:00:00Z | wikibase:NormalRank |
| wd:Q1658752 | Kagan     |      48054 | 1989-01-01T00:00:00Z | wikibase:NormalRank |
+-------------+-----------+------------+----------------------+---------------------+

Would you prefer the most recent statement or the "eternal" one?

This is how you can find the most recent population:

SELECT DISTINCT ?city ?cityLabel ?population WHERE {
  VALUES (?settlement) {(wd:Q515) (wd:Q15284)}
  VALUES (?city) {(wd:Q1658752)}
  ?city wdt:P31/wdt:P279* ?settlement . 
  ?city p:P1082 [ ps:P1082 ?population; pq:P585 ?date1 ]  
  FILTER NOT EXISTS {
    ?city p:P1082 [ pq:P585 ?date2 ]
    FILTER (?date2 > ?date1) }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }   
}

Try it

This is how you can find the "eternal" one:

SELECT DISTINCT ?city ?cityLabel ?population WHERE {
  VALUES (?settlement) {(wd:Q515) (wd:Q15284)}
  VALUES (?city) {(wd:Q1658752)}
  ?city wdt:P31/wdt:P279* ?settlement . 
  ?city p:P1082 ?statement .
  ?statement ps:P1082 ?population .
  FILTER NOT EXISTS {?statement pq:P585 []}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }   
}

Try it


In fact, almost 70% (not 0.1%) of entries with the P1082 property do not have preferred statements for this property. Your should rather mean entries with the P1082 property that have more than one truthy statement for this property. Recall that:

Truthy statements represent statements that have the best non-deprecated rank for given property. Namely, if there is a preferred statement for property P2, then only preferred statements for P2 will be considered truthy. Otherwise, all normal-rank statements for P2 are considered truthy.

And yes, about 0.5% entries that have P1082-statements have two or more truthy P1082-statements.

Upvotes: 1

Related Questions