Reputation: 53
So I'm currently working on a project where Im using data that I'm getting from Wikidata and I noticed a lot of duplicate elements in my database. Reason for that is that I'm receiving population numbers for different points in time.
I've read that Wikidata has rankings for statements with multipile values and for the population property that seems to be the most recent value-which is true for about 99.9% of the entries. What I don't understand is why it doesn't work for the other 0.1%.
One example would be: Wikidata query
The same happens for example with the elements
and I have no idea why.
I've already tried the solution from this topic but it didn't change the result.
Any ideas?
Edit based on the filter option from the thread: wikidata query 2
Edit 2: Full query
Upvotes: 3
Views: 619
Reputation: 11479
Some Wikidata properties are processed by PreferentialBot (source code).
In short, the bot makes the most recent statements preferred, hence making them truthy.
Sometimes the bot does not process statements for a property. For example, the bot doesn't process items that have statements without respective qualifiers.
In your particular case:
SELECT DISTINCT ?city ?cityLabel ?population ?date ?rank WHERE {
VALUES (?settlement) {(wd:Q515) (wd:Q15284)}
VALUES (?city) {(wd:Q1658752)}
?city wdt:P31/wdt:P279* ?settlement .
?city p:P1082 ?statement .
?statement ps:P1082 ?population .
?statement wikibase:rank ?rank
OPTIONAL { ?statement pq:P585 ?date }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
} ORDER by ?date
Results:
+-------------+-----------+------------+----------------------+---------------------+
| city | cityLabel | population | date | rank |
+-------------+-----------+------------+----------------------+---------------------+
| wd:Q1658752 | Kagan | 86745 | | wikibase:NormalRank |
| wd:Q1658752 | Kagan | 17656 | 1939-01-01T00:00:00Z | wikibase:NormalRank |
| wd:Q1658752 | Kagan | 21103 | 1959-01-01T00:00:00Z | wikibase:NormalRank |
| wd:Q1658752 | Kagan | 34117 | 1970-01-01T00:00:00Z | wikibase:NormalRank |
| wd:Q1658752 | Kagan | 41565 | 1979-01-01T00:00:00Z | wikibase:NormalRank |
| wd:Q1658752 | Kagan | 48054 | 1989-01-01T00:00:00Z | wikibase:NormalRank |
+-------------+-----------+------------+----------------------+---------------------+
Would you prefer the most recent statement or the "eternal" one?
This is how you can find the most recent population:
SELECT DISTINCT ?city ?cityLabel ?population WHERE {
VALUES (?settlement) {(wd:Q515) (wd:Q15284)}
VALUES (?city) {(wd:Q1658752)}
?city wdt:P31/wdt:P279* ?settlement .
?city p:P1082 [ ps:P1082 ?population; pq:P585 ?date1 ]
FILTER NOT EXISTS {
?city p:P1082 [ pq:P585 ?date2 ]
FILTER (?date2 > ?date1) }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
}
This is how you can find the "eternal" one:
SELECT DISTINCT ?city ?cityLabel ?population WHERE {
VALUES (?settlement) {(wd:Q515) (wd:Q15284)}
VALUES (?city) {(wd:Q1658752)}
?city wdt:P31/wdt:P279* ?settlement .
?city p:P1082 ?statement .
?statement ps:P1082 ?population .
FILTER NOT EXISTS {?statement pq:P585 []}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
}
In fact, almost 70% (not 0.1%) of entries with the P1082
property do not have preferred statements for this property. Your should rather mean entries with the P1082
property that have more than one truthy statement for this property. Recall that:
Truthy statements represent statements that have the best non-deprecated rank for given property. Namely, if there is a preferred statement for property
P2
, then only preferred statements forP2
will be considered truthy. Otherwise, all normal-rank statements forP2
are considered truthy.
And yes, about 0.5% entries that have P1082
-statements have two or more truthy P1082
-statements.
Upvotes: 1