Reputation: 11
I am trying to scrape this webpage: https://cordis.europa.eu/project/id/654008 to obtain in tabular format some of the values in XML: legalName, shortName, country, id, city, vatNumber, ecContribution, etc.
Here is an example with headers of the table I wish to obtain.
I have tried using IMPORTXML in Google Sheets but with no great success so far. My problem is that I cannot identify the correct xpath query, even when using chrome extensions like SelectorGadget and XPath Helper who generally helps you find them.
One note: it seems that the "coordinator" table use different html tags. Do you have any suggestions to correctly scrape this page? Thank you in advance!
Upvotes: 1
Views: 81
Reputation: 589
When we try to fetch this page directly it returns JavaScript is disabled on your browser. But when we analyze requests of the page we find a request than returns all data you need as a JSON. It's
https://cordis.europa.eu/api/details?contenttype=project&rcn=654008&lang=en¶mType=id
We fetch this page via URLFetchApp, and push data we wish to an array, then write them to our sheet.
function fetch() {
var url = 'https://cordis.europa.eu/api/details?contenttype=project&rcn=654008&lang=en¶mType=id';
var response = UrlFetchApp.fetch(url);
var json = JSON.parse(response.getContentText()).payload.organizations.participants;
var array = [];
json.forEach(el => {
array.push([el.name, , el.country.name, el.organizationId, el.address.city, , el.contribution]);
})
var ss = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('main');
var myRange = ss.getRange(2, 1, array.length, 7);
myRange.setValues(array);
}
Upvotes: 2
Reputation: 1
maybe this will help:
=INDEX(REGEXEXTRACT(QUERY(FLATTEN(IMPORTDATA(
"https://cordis.europa.eu/project/id/654008/en?format=xml")),
"where Col1 contains 'country'"), ">(.*)<"))
Upvotes: 1