Debs
Debs

Reputation: 133

Querying page and Scraping it using Sheets

I wanna use Sheets to query pages from wikidata and scrape a specific section but I couldn't find anything focused specifically on this, and since I'm a beginner in this I don't know really where to start from. So, I have a list of Q identifiers and I'd like to use them to query the page, and then check if there's a specific section there (or scrape the data from it if possible) otherwise return false. I started with what I found here

=ImportXml(concat("https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&
ppprop=wikibase_item&redirects=1&format=xml&titles=",G1),"//@wikibase_item")

but used instead the link for wiki data (it generates a working link, but I'm not sure if this works as the equivalent of an API and if I can further query it to get data) and the wiki Property code that I wanna get (date of death, /wiki/Property:P570), but I get the "Imported content is empty." error, for this link. Ideally, I would've like to get the date of death value (20 November 2014), or at least a TRUE , meaning that the section exists, and the person is dead.

=IMPORTXML(CONCAT("https://www.wikidata.org/wiki/",A2),"/wiki/Property:P570")

So I might have some Q-links that don't have this section/property at all, and for which I should get an error, but I don't know why it's not working for this one either, do I have to make Xpath to div, or can I use the wiki Property?

I hope this makes sense, I'll put the sample sheet here. Thanks

Upvotes: 2

Views: 225

Answers (1)

player0
player0

Reputation: 1

try:

=QUERY(IMPORTXML("https://www.wikidata.org/wiki/"&A1, "//*"), 
 "select Col2 where Col1 = 'date of death' and Col2 is not null")

enter image description here

or:

=QUERY(IMPORTXML("https://www.wikidata.org/wiki/"&A1, "//*"), 
 "select Col2 where Col1 = 'date of death' and Col2 is not null")<>""

and for no match:

=IFERROR(QUERY(IMPORTXML("https://www.wikidata.org/wiki/"&A1, "//*"), 
 "select Col2 where Col1 = 'date of death' and Col2 is not null"), FALSE)

enter image description here


=IFERROR(REGEXEXTRACT(QUERY(IMPORTXML("wikidata.org/wiki/"&A2, "//*"), 
 "select Col2 where Col1 = 'date of birth' and Col2 is not null"), 
 "(.*) \d.*reference.*"), FALSE)

Upvotes: 4

Related Questions