Cesare
Cesare

Reputation: 1749

How to parse XML in Google Refine to extract data?

I need to parse an XML using Google Refine to extract some data from it.

The XML is something like this one

<wfs:FeatureCollection xsi:schemaLocation="http://mapserver.gis.umn.edu/mapserver http://wms.pcn.minambiente.it/ogc?map=/ms_ogc/wfs/Numeri_Civici_2012.map&SERVICE=WFS&VERSION=1.1.0&REQUEST=DescribeFeatureType&TYPENAME=IN.NUMERICIVICI.2012&OUTPUTFORMAT=text/xml;%20subtype=gml/3.1.1  http://www.opengis.net/wfs http://schemas.opengis.net/wfs/1.1.0/wfs.xsd"><gml:boundedBy><gml:Envelope srsName="EPSG:4326"><gml:lowerCorner>45.808287 9.575815</gml:lowerCorner><gml:upperCorner>45.808287 9.575815</gml:upperCorner></gml:Envelope></gml:boundedBy><gml:featureMember><ms:IN.NUMERICIVICI.2012 gml:id="IN.NUMERICIVICI.2012.1225789"><gml:boundedBy><gml:Envelope srsName="EPSG:4326"><gml:lowerCorner>45.808287 9.575815</gml:lowerCorner><gml:upperCorner>45.808287 9.575815</gml:upperCorner></gml:Envelope></gml:boundedBy><ms:boundary><gml:Point srsName="EPSG:4326"><gml:pos>45.808287 9.575815</gml:pos></gml:Point></ms:boundary><ms:id>13800026062251</ms:id><ms:nome>Via Milano</ms:nome><ms:civico>55</ms:civico><ms:istat>03016023</ms:istat><ms:cap>24030</ms:cap><ms:comune>BERBENNO</ms:comune><ms:nome_ted> </ms:nome_ted><ms:provincia>BERGAMO</ms:provincia><ms:regione>LOMBARDIA</ms:regione></ms:IN.NUMERICIVICI.2012></gml:featureMember></wfs:FeatureCollection>

and I'd like to extract the coordinates about this tag

<gml:Point srsName="EPSG:4326"><gml:pos>45.808287 9.575815</gml:pos></gml:Point>

so the single numbers 45.808287 and 9.575815 in two different fields

I've tried something like

value.parseHtml().select("Point|pos")......

but I don't know how to continue.

Any suggestions?

Upvotes: 3

Views: 1960

Answers (2)

Cesare
Cesare

Reputation: 1749

I've found the solution ... Here you're how ...

toNumber(split(trim(substring(value.parseHtml().select("gml|Point gml|pos")[0].toString(),10, -10)), " ")[0])

and

toNumber(split(trim(substring(value.parseHtml().select("gml|Point gml|pos")[0].toString(),10, -10)), " ")[1])

These links are very useful for this stuff ....

Upvotes: 2

Tom Morris
Tom Morris

Reputation: 10540

Congratulations on finding the solution on your own (and posting it, along with references, to benefit others).

I'd actually suggest doing it slightly differently:

value.parseHtml().select('gml|Point gml|pos')[0].htmlText().trim().split(' ')[0].toNumber()

The main important difference is using the htmlText() to extract the text content rather than converting the element to a string (with tags) and then removing the tags using a fixed substring.

The other, purely stylistic, change I made was to use function chaining so that the order of operations reads left to right rather than inside to outside.

The trim() that I included isn't necessary in your example, but I included it as good hygiene. A minimal version would be:

value.parseHtml().select('gml|Point gml|pos')[0].htmlText().split(' ')[0].toNumber()
value.parseHtml().select('gml|Point gml|pos')[1].htmlText().split(' ')[0].toNumber()

You could also use the forEach operator to get the coordinates in an array if that's more useful for your use case:

forEach(value.parseHtml().select('gml|Point gml|pos')[0].htmlText().split(' '), n, n.toNumber())

Upvotes: 3

Related Questions